Beautiful Soup区分带标签或不带标签的文本


html1=
...
<span class="ruby"><span class="rb">textrb1 </span><span class="rt">textrt1 </span></span>text1 <span class="ruby"><span class="rb">textrb2 </span><span class="rt">textrt2 </span></span>text2
...

最后,我想打印一些类似的内容:textrb1 (textrt1) text1 textrb2 (textrt2) text2,括号中的文本。如果我打印html1.text,我会得到所有没有括号的文本:textrb1 textrt1 text1 textrb2 textrt2 text2我可以通过html1.find('span',class='rt')访问textrb1。我想知道如何像一样以正确的顺序访问"正常"文本text1和text2

for text in volltext:
if text is textrt:
texts.append('('+text+')')
else:
texts.append(text)

您可以在NavigableString:上使用.find_parent(class_="rt")

from bs4 import BeautifulSoup
html_doc = """
<span class="ruby"><span class="rb">textrb1 </span><span class="rt">textrt1 </span></span>text1 <span class="ruby"><span class="rb">textrb2 </span><span class="rt">textrt2 </span></span>text2
"""
soup = BeautifulSoup(html_doc, "html.parser")
out = []
for text in soup.find_all(text=True):
if text.strip() == "":
continue
if text.find_parent(class_="rt"):
out.append("({})".format(text.strip()))
else:
out.append(text.strip())
print(" ".join(out))

打印:

textrb1 (textrt1) text1 textrb2 (textrt2) text2

最新更新