解析<p>具有某些类(<p 类 " " >)的段落之间的段落 ()。美丽汤



我正在解析一个包含一堆文本的网页。页面格式如下:

January
1
text text text
even more text
2
text text text
even more text

我想将此页面解析为如下所示的词典列表:

[{'month': 'January', 'day': '1', 'text':'sample text for January 1st'},
{'month': 'January', 'day': '2', 'text':'text for January 2nd'},
{'month': 'January', 'day': '3', 'text':'January 3rd'},
...]

这是此页面的 html 视图:

索引.html

<div id="january">
<h2><span>January</span></h2>
<p class="subtitle centerline">1</p>
<p>sample text for January 1st</p>
<p>even more sample text</p>
<p class="subtitle centerline">2</p>
<p>sample text for January 2nd</p>
<p>different sample text</p>
<p class="right"><em>John Smith</em></p>
<p class="subtitle centerline">3</p>
...
</div>

我成功地编写了解析脚本的第一部分,我可以在其中获取月份和日期。

scrape.py

data = []
day_dict = {}
months = ['january', 'february', ...]
for month in months:
month_block = soup.find(id=month)
month_name = month_block.find('h2').string
days = []
for i in month_block.find_all(class_="subtitle"):
day = i.string.strip()
day_dict['month'] = month_name 
day_dict['day'] = day
data.append(day_dict.copy())

这将生成以下字典列表:

[{'month': 'January', 'day': '1'},
{'month': 'January', 'day': '2'},
{'month': 'January', 'day': '3'},
...]

由于

带有示例文本的标签不是日期child,因此我可以指定要获取的段落。

问题

有没有办法只获取位于具有相同类的两个标签之间的文本? 例如(在伪代码中(:

for i in month_block.find_all(class_="subtitle"):
day = i.string.strip()
text =  month_block.find_all("p").after(day).before(day + 1) # new line in pseudo code 
day_dict['month'] = month_name 
day_dict['day'] = day
day_dict['text'] = text # new line 
data.append(day_dict.copy())

请原谅粗略的伪代码。如果您想了解更多详情或解释,请告诉我。

感谢您抽出宝贵时间阅读本文。

试试这个:

import bs4
calendar = {}
text = """<div id="january">
<h2><span>January</span></h2>
<p class="subtitle centerline">1</p>
<p>sample text for January 1st</p>
<p>even more sample text</p>
<p class="subtitle centerline">2</p>
<p>sample text for January 2nd</p>
<p>different sample text</p>
<p class="right"><em>John Smith</em></p>
<p class="subtitle centerline">3</p>
</div>"""
soup = bs4.BeautifulSoup(text, "html.parser")
for month_div in soup.children:
month = month_div.find('h2').string
calendar[month] = {}
for entry in month_div.find_all(class_="subtitle"):
day = entry.string.strip()
events = []
s = entry
while True:
s = s.find_next_sibling()
if s and "subtitle" not in s.attrs.get("class", []):
events.append(s.string.strip())
else:
break
calendar[month][day] = events
print(calendar)

输出:

{'January': {'1': ['sample text for January 1st', 'even more sample text'], '2': ['sample text for January 2nd', 'different sample text', 'John Smith'], '3': []}}

我推荐另一种解决方案,它非常适合从XML中提取数据。

from simplified_scrapy.spider import SimplifiedDoc 
html='''
<div id="january">
<h2><span>January</span></h2>
<p class="subtitle centerline">1</p>
<p>sample text for January 1st</p>
<p>even more sample text</p>
<p class="subtitle centerline">2</p>
<p>sample text for January 2nd</p>
<p>different sample text</p>
<p class="right"><em>John Smith</em></p>
<p class="subtitle centerline">3</p>
...
</div>
'''
data = []
months = ['january', 'february']
doc = SimplifiedDoc(html) # create doc
for month in months:
month_block = doc.select('#'+month)
if not month_block: continue
month_name = month_block.h2.text
for i in month_block.selects(".subtitle"):
day_dict = {"month":month_name,"day":i.text,"text":i.next.text}
data.append(day_dict)
print (data)

结果:

[{'month': 'January', 'day': '1', 'text': 'sample text for January 1st'}, {'month': 'January', 'day': '2', 'text': 'sample text for January 2nd'}, {'month': 'January', 'day': '3', 'text': None}]

以下是更多示例:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/

最新更新