如何在重新找到str时选择上一个标记

我有一个HTML文件，如下所示：(超过100条记录(

<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>

如果他们是员工I，我需要提取姓名，这很有挑战性。如何选择下一个标签中有"员工I"的标签？或者我应该使用不同的方法？在这种情况下有可能使用条件吗？

with open("file.html", 'r') as input:
html = input.read()
print(re.search(r'bEmployee Ib',html).group(0))

比如，我如何指定去读取以前的标签？

import re
from bs4 import BeautifulSoup
with open('inputfile.html', encoding='utf-8') as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
names = [span.parent.find('h3').string 
for span in 
soup.find_all('span', 
class_='light-text', 
string=re.compile('Employee I$'))
]
print(names)

给出

['John Smith', 'Jenna Smith']

为了清晰起见，我已经将列表理解格式化了好几行，这样就可以更容易地看到在哪里根据其他用例进行相应的调整。当然，正常的for循环和附加到列表中也可以很好地工作；我只是喜欢列表综合。

re.compile('Employee I$')是避免在'Employee II'上匹配所必需的。class_参数是额外的，可能不需要。

其余部分几乎不言自明，尤其是旁边的BeautifulSoup文档

请注意，如果.string属性曾经是.text，以防您使用的是BeautifulSoup的旧版本。

from bs4 import BeautifulSoup
test = '''<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>'''
soup = BeautifulSoup(test)
for person in soup.findAll('div'):
names = person.find('h3').text
employee_nb = person.find('span').text.split('-')[2].strip()
if employee_nb == "Employee I":
print(names)

相关内容

最新更新

热门标签：