从这个问题中,以下代码迭代了h2
两个之间的标签:
from bs4 import BeautifulSoup, Tag
data = """<h2><name>Main Section</name><content>bla bla bla</content></h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>
<h3>Subsection 2</h3>
<p>Even more info!</p>
<h2><name>Main Section 2</name><content>blah...</content></h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>
<h3>Subsection 2</h3>
<p>Even more info!</p>"""
soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
for sibling in main_section.next_siblings:
if not isinstance(sibling, Tag):
continue
if sibling.name == 'h2':
break
print(sibling)
这非常有效,如果我在最后使用 print(sibling)
,则会迭代整个数据。但是,如果我使用 append
,则代码在单次运行后会中断:
soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
for sibling in main_section.next_siblings:
if not isinstance(sibling, Tag):
continue
if sibling.name == 'h2':
break
--------> main_section.content.append(sibling.extract())
内容中仅包含下一个兄弟姐妹(即使我删除了extract()
也会发生同样的事情)。输出为:
<h2><name>Main Section</name><content>bla bla bla<p>Bla bla bla</p></content></h2>
<h2><name>Main Section 2</name><content>blah...<p>bla</p></content></h2>
如果我再次运行代码,下一个标签将包含在<content>...</content>
标签中
基本上,我想将所有数据和小节包含在主部分的 content
标签中。
我想要的输出是:
<h2><name>Main Section</name><content>bla bla bla<p>Bla bla bla</p><h3>Subsection</h3><p>Some more info</p><h3>Subsection 2</h3><p>Even more info!</p></content></h2>
<h2><name>Main Section 2</name><content>blah...<p>bla</p><h3>Subsection</h3><p>Some more info</p><h3>Subsection 2</h3><p>Even more info!</p></content></h2>
- 为什么使用追加时迭代停止?
- 如何在两个主标签之间附加所有标签?
将标签附加到新列表解决了我的问题。
soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
x = []
for sibling in main_section.next_siblings:
if not isinstance(sibling, Tag):
continue
if sibling.name == 'h2':
break
x.append(sibling)
for y in x:
main_section.append(y)
然后我能够将所有兄弟姐妹附加到main_section
。