如何从父标记中排除嵌套的标记,只将输出作为跳过链接(a)标记的文本



我想排除包含的嵌套标签,就像在这种情况下忽略a标签一样"链接";与单词相关

base_url="https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response=requests.get(base_url)
html=response.content
bs=BeautifulSoup(html,parser="lxml")
article=bs.find_all("article",{"class":"gnt_pr"})
body=article[0].find_all("p",{"class":"gnt_ar_b_p"})

输出为-

[<p class="gnt_ar_b_p">An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.</p>,
<p class="gnt_ar_b_p">Experts call QAnon <a class="gnt_ar_b_a" data-t-l="|inline|intext|n/a" href="https://www.usatoday.com/in-depth/tech/2020/08/31/qanon-conspiracy-theories-trump-election-covid-19-pandemic-extremist-groups/5662374002/" rel="noopener" target="_blank">a "digital cult"</a> because of its pseudo-religious qualities and an extreme belief system that enthrones President Donald Trump as a savior figure crusading against evil.</p>,
<p class="gnt_ar_b_p">The core of QAnon is the false theory that Trump was elected to root out a secret child-sex trafficking ring run by Satanic, cannibalistic Democratic politicians and celebrities. Although it may sound absurd, it has nonetheless attracted devoted followers who have begun to perpetuate other theories that they suggest, imply or argue are somehow related to the main premise.</p>,

想要排除这些标签

要只从段落中获取文本,可以使用.get_text()方法。例如:

import requests
from bs4 import BeautifulSoup
base_url = "https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
body = soup.select("article p")
for paragraph in body:
print(paragraph.get_text(strip=True, separator=' '))

打印:

An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.

...etc.

或者:您可以.unwrap()段落内的所有元素,并获取文本:

for paragraph in body:
for tag in paragraph.find_all():
tag.unwrap()
print(paragraph.text)

最新更新