Python Beautiful Soup搜索包含列表中文本的元素



假设我有一个关键字列表,即["apple", "dog", "cat"]和一些HTML。我想返回所有包含其中一个关键字的元素作为其直系后代。我该怎么做?

我尝试过使用soup.find_all(text=keywords),但没有任何结果。

from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]

BeautifulSoup支持regex进行文本搜索,因此您可以将其与IGNORECASE标志一起使用(因为您的关键字是dog,并且您的元素包含Dogs(

import re
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
print(soup.find_all(text=re.compile("|".join(keywords), flags=re.IGNORECASE)))

>>>> ['I like apples', "Dogs are cool. I don't match this either.", 'I have a cat.']

作为一个音符,你说";直系后裔";并且具有CCD_ 7的元素具有"0";我也不符合这一点;。由于您的HTML是如何格式化的,它会接受这一点。如果这条线类似于<div>Dogs are cool. <div>I don't match this either.</div></div>,那么输出将是

["我喜欢苹果","狗很酷。","我有一只猫。"]

最新更新