在美丽的汤树中提取带有三个或更多搜索字符串的标签



我正在尝试在HTML文档中搜索3个(或更多(特定的正则表达式。 HTML文件都有不同的形式和布局,但有特定的单词,所以我可以搜索单词。

现在,我想返回该行:

<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em>  <strong>test</strong> with a CAR</p>
</div>

我已经尝试了很多版本的代码,但目前我在黑暗中磕磕绊绊。

import re
from bs4 import Tag, BeautifulSoup
text = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em>  <strong>test</strong> with a CAR</p>
</div>
</div>
</body>
"""

def searchme(bstag):
print("searchme")
regex1 = r"17738"
regex2 = r"CAR"
regex3 = r"greaterly"
switch1 = 0
switch2 = 0
switch3 = 0
result1 = bstag.find(string=re.compile(regex1, re.MULTILINE))
if len(result1) >= 1:
switch1 = 1
result2 = result1.parent.find(string=re.compile(regex2, re.MULTILINE))
if len(result2) >= 1:
switch2 = 1
result3 = result2.parent.find_all(string=re.compile(regex3, re.MULTILINE))
if len(result3) >= 1:
switch3 = 1
if switch1 == 1 and switch2 == 1 and switch3 == 1:
return bstag
else:
if bstag.parent is not None:
searchme(bstag.parent)
else:
searchme(result1.parent)
soup = BeautifulSoup(text, 'html.parser')
el = searchme(soup)
print(el)

编辑 1

更新了所需的返回代码

我不确定是否理解了这个例子,因为

text对象中没有包含所有 3 个regex项的元素。

但是,如果我正确解析了这个问题,我建议不要将regex用于此任务(就计算时间和负担而言,这是次优的(,而是依靠更简单in。您可以在下面找到一个 MWE,其中我稍微修改了原始示例中的文本以包含您感兴趣的行。

from bs4 import Tag, BeautifulSoup
text = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> greaterly <strong>test</strong> with a CAR</p>
</div>
</div>
</body>
"""
t1 = '17738' # terms to be searched
t2 = 'CAR'
t3 = 'greaterly'
soup = BeautifulSoup(text, 'html.parser')
for row in soup.findAll('div'): # parse the text line by line
if t1 in row.text and t2 in row.text and t3 in row.text: # if the line contains all terms
print(row.text)

您可以使用CSS选择器div:has(> p),它将搜索<div>标签,这些标签的标签正下方有<p>标签。

例如:

from bs4 import BeautifulSoup
text = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em>  <strong>test</strong> with a CAR</p>
</div>
</div>
</body>"""

to_search = ['17738', 'CAR', 'greaterly']
soup = BeautifulSoup(text, 'html.parser')
results = []
for div in soup.select('div:has(> p)'):  # search only divs that have <p> tags DIRECTLY under them
if all(word in div.text for word in to_search):
results.append(div)
print(results)

指纹:

[<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>]

另一种方法。

from simplified_scrapy import SimplifiedDoc
html =  """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em>  <strong>test</strong> with a CAR</p>
</div>
</div>
</body>
"""
regex1 = r"17738"
regex2 = r"CAR"
regex3 = r"greaterly"
doc = SimplifiedDoc(html)
p3s = doc.getElementsByReg(regex3,tag='p')
for p in p3s:
p2 = p.getNext('p')
if p2.contains([regex1,regex2],attr='html'):
# print (p2.outerHtml)
print (p2.parent.outerHtml) # Get div
break

结果:

<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em>  <strong>test</strong> with a CAR</p>
</div>

以下是更多示例: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

最新更新