给定此html结构
<strong><a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)</strong> has released an employment notification for the recruitment of <strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy</strong>
如果html结构中有fertilizer.com
,我需要删除整个元素/标签
因此,最终结果应该是:
null
我了解到bs4中有一个decompose()
方法来删除元素,但如何对父元素进行删除,如何导航到它。
请引导我。谢谢
给定唯一提供的HTML片段,这将是我的解决方案
从bs4进口BeautifulSoup
txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong>
has released an employment notification for the recruitment of
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
soup.decompose()
print(f'Content After decomposition:n{soup}')
# <None></None>
另一种解决方案是,如果你只想一无所获,那么如下所示;注意,第二个循环是删除未包含在特定标签中的自由文本
from bs4 import BeautifulSoup
txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong>
has released an employment notification for the recruitment of
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
# Handles tags
for el in soup.find_all():
el.replaceWith("")
# Handles free text like: 'has released an employment notification for the recruitment of ' (bevause is not in a particular tag)
for el in soup.find_all(text=True):
el.replaceWith("")
print(f'Content After decomposition:n{soup}')
相关文档
- 如何使用python和BeautifulSoup从xml中删除完整元素
- 使用ElementTree移除父标记(不移除子标记(
- 分解
- BeautifulSoup获取href[重复]
- 使用BeautifulSoup删除标签,但保留其内容