我正在尝试在HTML文档中搜索3个(或更多(特定的正则表达式。 HTML文件都有不同的形式和布局,但有特定的单词,所以我可以搜索单词。
现在,我想返回该行:
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>
我已经尝试了很多版本的代码,但目前我在黑暗中磕磕绊绊。
import re
from bs4 import Tag, BeautifulSoup
text = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>
</div>
</body>
"""
def searchme(bstag):
print("searchme")
regex1 = r"17738"
regex2 = r"CAR"
regex3 = r"greaterly"
switch1 = 0
switch2 = 0
switch3 = 0
result1 = bstag.find(string=re.compile(regex1, re.MULTILINE))
if len(result1) >= 1:
switch1 = 1
result2 = result1.parent.find(string=re.compile(regex2, re.MULTILINE))
if len(result2) >= 1:
switch2 = 1
result3 = result2.parent.find_all(string=re.compile(regex3, re.MULTILINE))
if len(result3) >= 1:
switch3 = 1
if switch1 == 1 and switch2 == 1 and switch3 == 1:
return bstag
else:
if bstag.parent is not None:
searchme(bstag.parent)
else:
searchme(result1.parent)
soup = BeautifulSoup(text, 'html.parser')
el = searchme(soup)
print(el)
编辑 1
更新了所需的返回代码
我不确定是否理解了这个例子,因为
text
对象中没有包含所有 3 个regex
项的元素。
但是,如果我正确解析了这个问题,我建议不要将regex
用于此任务(就计算时间和负担而言,这是次优的(,而是依靠更简单in
。您可以在下面找到一个 MWE,其中我稍微修改了原始示例中的文本以包含您感兴趣的行。
from bs4 import Tag, BeautifulSoup
text = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> greaterly <strong>test</strong> with a CAR</p>
</div>
</div>
</body>
"""
t1 = '17738' # terms to be searched
t2 = 'CAR'
t3 = 'greaterly'
soup = BeautifulSoup(text, 'html.parser')
for row in soup.findAll('div'): # parse the text line by line
if t1 in row.text and t2 in row.text and t3 in row.text: # if the line contains all terms
print(row.text)
您可以使用CSS选择器div:has(> p)
,它将搜索<div>
标签,这些标签的标签正下方有<p>
标签。
例如:
from bs4 import BeautifulSoup
text = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>
</div>
</body>"""
to_search = ['17738', 'CAR', 'greaterly']
soup = BeautifulSoup(text, 'html.parser')
results = []
for div in soup.select('div:has(> p)'): # search only divs that have <p> tags DIRECTLY under them
if all(word in div.text for word in to_search):
results.append(div)
print(results)
指纹:
[<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>]
另一种方法。
from simplified_scrapy import SimplifiedDoc
html = """
<body>
<div>
<div>
<p>This 19 is A BIG test</p>
<p>This is another test</p>
<p>19 that is yet <em>another</em> great <strong>test</strong> with a CAR</p>
</div>
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>
</div>
</body>
"""
regex1 = r"17738"
regex2 = r"CAR"
regex3 = r"greaterly"
doc = SimplifiedDoc(html)
p3s = doc.getElementsByReg(regex3,tag='p')
for p in p3s:
p2 = p.getNext('p')
if p2.contains([regex1,regex2],attr='html'):
# print (p2.outerHtml)
print (p2.parent.outerHtml) # Get div
break
结果:
<div>
<p>This 17 is A BIG test</p>
<p>This is another greaterly test</p>
<p>17738 that is yet <em>another</em> <strong>test</strong> with a CAR</p>
</div>
以下是更多示例: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples