如何通过与BeautifulSoup切换标签隐藏的元素进行搜索?



我试图从一个url提取和打印所有td标签的文本内容,该url拥有多个选项卡,显示页面的某些元素,并在单击时隐藏所有其他选项卡的内容(https://www.encodeproject.org/experiments/ENCSR000EEC/)。具体来说,我试图从"文件详细信息"中提取所有的td标签。选项卡(在页面中间看到的完整选项卡列表是:"基因组浏览器"、"关联图"one_answers"文件详细信息")。目前,我能够提取的唯一td标记是从带有标签的div上方的部分提取的,该标签也具有td标记。唯一包含td标签的选项卡是"File Details."如何访问"文件详细信息"中隐藏的内容?选项卡?当前代码:

def test_select_files(url):
texts = []
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
td_tags = soup.findAll('td')
for tag in td_tags: 
print(tag.text.strip())
test_select_files('https://www.encodeproject.org/experiments/ENCSR000EEC/')

期望输出(直接从url):

bed narrowPeak  
bigBed narrowPeak   
bigWig  
bed idr_ranked_peak 
bed narrowPeak  
...

您应该能够从HTML中返回的JSON中获得所需的所有信息:

from bs4 import BeautifulSoup
import requests
import json
r = requests.get('https://www.encodeproject.org/experiments/ENCSR000EEC/')
soup = BeautifulSoup(r.content, 'html.parser')
json_data = soup.find('script', type='application/json').string
data = json.loads(json_data)    
for file in data['files']:
print(f"{file['accession']}  {file['file_format']:10}  {file['output_type']}")

输出如下:

ENCFF000XTK  bam         alignments
ENCFF000XTL  bam         alignments
ENCFF000XTM  bigBed      peaks
ENCFF000XTP  bigWig      signal
ENCFF000XTZ  fastq       reads
ENCFF000XUA  fastq       reads
ENCFF001VKJ  bed         peaks
ENCFF002CUG  bed         optimal IDR thresholded peaks
ENCFF715UNN  bigBed      optimal IDR thresholded peaks
ENCFF836BQL  bam         unfiltered alignments

我建议你print(data)了解每个文件可用的其他信息。

最新更新