Python Beautiful Soup搜索包含列表中文本的元素

假设我有一个关键字列表，即["apple", "dog", "cat"]和一些HTML。我想返回所有包含其中一个关键字的元素作为其直系后代。我该怎么做？

我尝试过使用soup.find_all(text=keywords)，但没有任何结果。

from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]

BeautifulSoup支持regex进行文本搜索，因此您可以将其与IGNORECASE标志一起使用(因为您的关键字是dog，并且您的元素包含Dogs(

import re
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
print(soup.find_all(text=re.compile("|".join(keywords), flags=re.IGNORECASE)))

>>>> ['I like apples', "Dogs are cool. I don't match this either.", 'I have a cat.']

作为一个音符，你说"；直系后裔"；并且具有CCD_ 7的元素具有"0"；我也不符合这一点；。由于您的HTML是如何格式化的，它会接受这一点。如果这条线类似于<div>Dogs are cool. <div>I don't match this either.</div></div>，那么输出将是

["我喜欢苹果"，"狗很酷。"，"我有一只猫。"]

相关内容

最新更新

热门标签：