BeautifulSoup/Python-从DIV提取链接URL，取决于排除内容

我正在尝试在python 3.4中提取一个带有beautifutsoup4的链接，并且没有识别元素标记，例如id，class等。但是，在每个链接之前，都有如下：

的静态文本字符串

<h2>
 "Precluding-Text:"
  <a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>

我的最终目标是获取以下输出：

http://the-link-im-after.com/

您可以使用该静态文本来找到链接：

soup.find(text="Precluding-Text:").find_next_sibling("a")["href"]

或，您可能需要部分文本匹配：

soup.find(text=lambda text: text and "Precluding-Text:" in text).find_next_sibling("a")["href"]

使用Python Generators的另一种解决方案：

from bs4 import BeautifulSoup as soup
import re
html = """
<h2>
 "Precluding-Text:"
  <a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>
"""
s = soup(html)
elements = s.find_all(text=re.compile('.*Precluding-Text:.*'))
if len(elements) == 0:
    print("not found")
else:
    for elem in elements:
        gen = elem.next_siblings
        a_tag = next(gen)
        if a_tag.get('href') is not None:
            print(a_tag.get('href'))

相关内容

最新更新

热门标签：