我有以下html内容:
<a href="http://app_url1" >install app xyz</a>
<a href="http://app_url2" >install app xyz</a>
<a href="http://app_url3" >install app aaa</a>
<a href="http://app_url4">install app aaa</a>
我想过滤以给定正则表达式模式结束的锚标记文本(如这里的xyz(?我希望将regex模式传递给findAll,而不是所有anchor
标记的额外迭代。
您可以在find_all
方法中使用beautifulSouptext
参数。
from bs4 import BeautifulSoup
import re
html = """<a href="http://app_url1" >install app xyz</a>
<a href="http://app_url2" >install app xyz</a>
<a href="http://app_url3" >install app aaa</a>
<a href="http://app_url4">install app aaa</a>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.findAll("a", text=re.compile("xyz$")))
输出:
[<a href="http://app_url1">install app xyz</a>, <a href="http://app_url2">install app xyz</a>]
将lambda与str.endswith
一起使用
例如:
from bs4 import BeautifulSoup
html = """<div><a href="http://app_url1" >install app xyz</a>
<a href="http://app_url2" >install app xyz</a>
<a href="http://app_url3" >install app aaa</a>
<a href="http://app_url4">install app aaa</a></div>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all("a", text=lambda x: x is not None and x.endswith("xyz")))
# --> [<a href="http://app_url1">install app xyz</a>, <a href="http://app_url2">install app xyz</a>]
我想你可以试试这个来获取锚标签文本:
>>> html = """<a href="http://app_url1" >install app xyz</a>
... <a href="http://app_url2" >install app xyz</a>
... <a href="http://app_url3" >install app aaa</a>
... <a href="http://app_url4">install app aaa</a>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> anchor_texts = []
>>> anchor_texts.append(soup.get_text())
>>> for i in anchor_texts:
... print(i)
输出:
install app xyz
install app xyz
install app aaa
install app aaa