如何删除BeautifulSoup中的父元素



给定此html结构

<strong><a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)</strong> has released an employment notification for the recruitment of <strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy</strong> 

如果html结构中有fertilizer.com,我需要删除整个元素/标签

因此,最终结果应该是:

null

我了解到bs4中有一个decompose()方法来删除元素,但如何对父元素进行删除,如何导航到它。

请引导我。谢谢

给定唯一提供的HTML片段,这将是我的解决方案

从bs4进口BeautifulSoup

txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong> 
has released an employment notification for the recruitment of 
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong> 
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
soup.decompose()
print(f'Content After decomposition:n{soup}')
# <None></None>

另一种解决方案是,如果你只想一无所获,那么如下所示;注意,第二个循环是删除未包含在特定标签中的自由文本

from bs4 import BeautifulSoup

txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong> 
has released an employment notification for the recruitment of 
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong> 
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
# Handles tags
for el in soup.find_all():
el.replaceWith("")
# Handles free text like: 'has released an employment notification for the recruitment of ' (bevause is not in a particular tag) 
for el in soup.find_all(text=True):
el.replaceWith("")
print(f'Content After decomposition:n{soup}')

相关文档

  • 如何使用python和BeautifulSoup从xml中删除完整元素
  • 使用ElementTree移除父标记(不移除子标记(
  • 分解
  • BeautifulSoup获取href[重复]
  • 使用BeautifulSoup删除标签,但保留其内容

最新更新