美丽的汤解析多个标签



我正在为我的学校评分系统处理数据,我正在尝试弄清楚如何按类别提取数据。

这是原始的 HTML:https://pastebin.com/icbaemd7

现在,我已经编写了一个Python脚本:

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
chemData = ((soup.find_all('td')))
content = []
print chemData
print ""
for i in chemData:
content.append(i.getText().split('</td')[0])
for k in content:
print (k)

返回此结果:

Safety Contract Signed
1/1
8/13/2019
Student Profile Sheet Turned In
1/1
8/13/2019
Polyatomic Ion Quiz
10/10
8/19/2019
HW Quiz Ch. 3 Target 6
3/3
8/27/2019
HW Quiz (Ch. 3 Targets 1-6)
12/16
8/28/2019
Chapters 1 & 2 Formative Quiz
15/17
8/21/2019
Chapter 3 Formative Quiz
23.5/25
9/5/2019
Lab Report: Antifreeze Lab
10/10
8/21/2019
Types of Reactions Lab Report
11/12
8/23/2019
Hydrate Lab Report
10/10
8/29/2019
Lab Assessment - Types of Reactions Lab
10/15
8/26/2019
Lab Assessment: Hydrate Lab
10/10
9/3/2019

但是,我想将它们分类为 HTML 中存在的类别。如果我用h3而不是td运行相同的脚本,我会得到它们:

Homework
Formative Quizzes
Lab Reports
Lab Assessments

所以我的问题是:如何让它自动将实际作业分类到相应的类别中?

任何帮助将不胜感激。谢谢!

尝试如下操作,测试h3并使字典键,否则从当前字典[键]下的行中添加值

from bs4 import BeautifulSoup as bs
html = '''yourHTML'''
soup = bs(html, 'lxml')
results = {}
for i in soup.select('h3, tr'):
if i.name == 'h3':
header = i.text
results[header] = []
else:
results[header].append(' '.join([n.text for n in i.select('td')]))
print(results)

您的 html 无法正确呈现。但是,作为快速解决方案,请查找同时包含每个类别的 h3 标签和表的父容器,并首先抓取父容器。例如,让我们假设 h3 标签和表在div 下。然后首先抓取div标签,即d = soup.findall('div'(。然后进一步遍历 d 以提取 h3 标签,然后提取 tr/td。例如 d[0].findall('h3'( d[0].findall('td'( 等等。

最新更新