beautifulsoup-过滤锚标记的文本



我有以下html内容:

<a href="http://app_url1" >install app xyz</a>
<a href="http://app_url2" >install app xyz</a>
<a href="http://app_url3" >install app aaa</a>
<a href="http://app_url4">install app aaa</a>

我想过滤以给定正则表达式模式结束的锚标记文本(如这里的xyz(?我希望将regex模式传递给findAll,而不是所有anchor标记的额外迭代。

您可以在find_all方法中使用beautifulSouptext参数。

from bs4 import BeautifulSoup
import re
html = """<a href="http://app_url1" >install app xyz</a>
<a href="http://app_url2" >install app xyz</a>
<a href="http://app_url3" >install app aaa</a>
<a href="http://app_url4">install app aaa</a>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.findAll("a", text=re.compile("xyz$")))

输出:

[<a href="http://app_url1">install app xyz</a>, <a href="http://app_url2">install app xyz</a>]

将lambda与str.endswith一起使用

例如:

from bs4 import BeautifulSoup
html = """<div><a href="http://app_url1" >install app xyz</a>
<a href="http://app_url2" >install app xyz</a>
<a href="http://app_url3" >install app aaa</a>
<a href="http://app_url4">install app aaa</a></div>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all("a", text=lambda x: x is not None and x.endswith("xyz")))
# --> [<a href="http://app_url1">install app xyz</a>, <a href="http://app_url2">install app xyz</a>]

我想你可以试试这个来获取锚标签文本:

>>> html = """<a href="http://app_url1" >install app xyz</a>
... <a href="http://app_url2" >install app xyz</a>
... <a href="http://app_url3" >install app aaa</a>
... <a href="http://app_url4">install app aaa</a>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> anchor_texts = []
>>> anchor_texts.append(soup.get_text())
>>> for i in anchor_texts:
...    print(i)

输出:

install app xyz
install app xyz
install app aaa
install app aaa

最新更新