我试图从一个url提取和打印所有td标签的文本内容,该url拥有多个选项卡,显示页面的某些元素,并在单击时隐藏所有其他选项卡的内容(https://www.encodeproject.org/experiments/ENCSR000EEC/)。具体来说,我试图从"文件详细信息"中提取所有的td标签。选项卡(在页面中间看到的完整选项卡列表是:"基因组浏览器"、"关联图"one_answers"文件详细信息")。目前,我能够提取的唯一td标记是从带有标签的div上方的部分提取的,该标签也具有td标记。唯一包含td标签的选项卡是"File Details."如何访问"文件详细信息"中隐藏的内容?选项卡?当前代码:
def test_select_files(url):
texts = []
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
td_tags = soup.findAll('td')
for tag in td_tags:
print(tag.text.strip())
test_select_files('https://www.encodeproject.org/experiments/ENCSR000EEC/')
期望输出(直接从url):
bed narrowPeak
bigBed narrowPeak
bigWig
bed idr_ranked_peak
bed narrowPeak
...
您应该能够从HTML中返回的JSON中获得所需的所有信息:
from bs4 import BeautifulSoup
import requests
import json
r = requests.get('https://www.encodeproject.org/experiments/ENCSR000EEC/')
soup = BeautifulSoup(r.content, 'html.parser')
json_data = soup.find('script', type='application/json').string
data = json.loads(json_data)
for file in data['files']:
print(f"{file['accession']} {file['file_format']:10} {file['output_type']}")
输出如下:
ENCFF000XTK bam alignments
ENCFF000XTL bam alignments
ENCFF000XTM bigBed peaks
ENCFF000XTP bigWig signal
ENCFF000XTZ fastq reads
ENCFF000XUA fastq reads
ENCFF001VKJ bed peaks
ENCFF002CUG bed optimal IDR thresholded peaks
ENCFF715UNN bigBed optimal IDR thresholded peaks
ENCFF836BQL bam unfiltered alignments
我建议你print(data)
了解每个文件可用的其他信息。