给定以下html,如何删除BeautifulSoup中除样式标签(如<strong>
或<em>
(之外的所有标签?
<ol class="journal">
<li>A. Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason Hattrick-Simpers, Brian
DeCost, Suchismita Sarker, Corey Oses, Cormac Toher, Stefano Curtarolo, Albert V. Davydov,
Ritesh Agarwal, Leonid A. Bendersky, Mo Li, Apurva Mehta, Ichiro Takeuchi. <strong>On-the-fly
closed-loop materials discovery via Bayesian active learning</strong>. <em>Nature Communications</em>, 2020; 11 (1) DOI: <a href="http://dx.doi.org/10.1038/s41467-020-19597-w" rel="nofollow" target="_blank">10.1038/s41467-020-19597-w</a>
</li>
</ol>
我知道我可以使用regex来删除特定的标签,但有没有什么优雅的方法可以删除BeautifulSoup中的一些标签,同时排除其他标签?
使用soup.descendants
:
[node for node in soup.descendants if node.name in ['strong','em']]
试试这个:
import re
from bs4 import BeautifulSoup as bs
html = """<ol class="journal">
<li>A. Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason
Hattrick-Simpers, Brian DeCost, Suchismita Sarker, Corey Oses, Cormac Toher,
Stefano Curtarolo, Albert V. Davydov, Ritesh Agarwal, Leonid A. Bendersky,
Mo Li, Apurva Mehta, Ichiro Takeuchi. <strong>On-the-fly closed-loop
materials discovery via Bayesian active learning</strong>.
<em>Nature Communications</em>, 2020; 11 (1) DOI:
<a href="http://dx.doi.org/10.1038/s41467-020-19597-w" rel="nofollow"
target="_blank">10.1038/s41467-020-19597-w</a>
</li>
</ol>"""
soup = bs(html, features='xml')
tags = [tag.name for tag in soup.find_all(True) if tag.name not in ['strong', 'em']]
for tag in tags:
html = re.sub(f'</?{tag}[^>]*>', '', html)
print(html)
输出:
A. Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason Hattrick-Simpers,
Brian DeCost, Suchismita Sarker, Corey Oses, Cormac Toher, Stefano Curtarolo,
Albert V. Davydov, Ritesh Agarwal, Leonid A. Bendersky, Mo Li, Apurva Mehta,
Ichiro Takeuchi. <strong>On-the-fly closed-loop materials discovery
via Bayesian active learning</strong>. <em>Nature Communications</em>,
2020; 11 (1) DOI: 10.1038/s41467-020-19597-w