假设我有以下HTML:
<div class="test">
<a class="someclass" href="somesite.com"> LINK </a>
<a class="someclass" href="othersite.com"> IMAGE</a>
</div>
是否有办法从包含文本"链接"的所有a-标签中获取href
?即在本例中somesite.com
?
问题是,你试图找到有空白的文本。您可以使用Regular Expressions
来忽略空白,并在text
find
。from bs4 import BeautifulSoup
import re
html = '''<div class="test">
<a class="someclass" href="somesite.com"> LINK </a>
<a class="someclass" href="othersite.com"> IMAGE</a>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
regex = re.compile(r's*%ss*' % 'LINK')
results = soup.find("a", text=regex)
print(results['href'])
输出:
somesite.com
另一种方法是执行find_all
&然后循环遍历结果&使用text.strip()
from bs4 import BeautifulSoup
html = '''<div class="test">
<a class="someclass" href="somesite.com"> LINK </a>
<a class="someclass" href="othersite.com"> IMAGE</a>
</div>'''
# Find href by text 'link'
soup = BeautifulSoup(html, 'html.parser')
results = soup.find_all('a')
print([x['href'] for x in results if x.text.strip() == 'LINK'])
['somesite.com']