假设我有一个关键字列表,即["apple", "dog", "cat"]
和一些HTML。我想返回所有包含其中一个关键字的元素作为其直系后代。我该怎么做?
我尝试过使用soup.find_all(text=keywords)
,但没有任何结果。
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
BeautifulSoup支持regex
进行文本搜索,因此您可以将其与IGNORECASE
标志一起使用(因为您的关键字是dog
,并且您的元素包含Dogs
(
import re
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
print(soup.find_all(text=re.compile("|".join(keywords), flags=re.IGNORECASE)))
>>>> ['I like apples', "Dogs are cool. I don't match this either.", 'I have a cat.']
作为一个音符,你说";直系后裔";并且具有CCD_ 7的元素具有"0";我也不符合这一点;。由于您的HTML是如何格式化的,它会接受这一点。如果这条线类似于<div>Dogs are cool. <div>I don't match this either.</div></div>
,那么输出将是
["我喜欢苹果","狗很酷。","我有一只猫。"]