根据HTML的字体家庭类型提取文本



我有一个html数据,我只想提取出现在粗体字体下的文本。

<span style="font-family: ABCDEE+Cambria,Bold; font-size:9px">Pinecone Functions 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:419px; top:1903px; width:76px; height:11px;"><span style="font-family: ABCDEE+Calibri,Bold; font-size:7px">Trainee Sign-Off 
<br></span></div>

我只想要在字体家庭下的文本:Abcdee Cambria,Bold。

with open('/home/output4.html') as file:
    text = file.read()
soup = BeautifulSoup(text, 'html.parser')
x = soup.find_all('span', style=re.compile(r'font-family: ABCDEE+Cambria,Bold.*'))
for rows in x:
    print(rows.text)

我尝试了此BT获取空列表。

+是以至于以下的特殊字符,您应该逃脱它(请注意+而不是+(

示例:

from bs4 import BeautifulSoup
import re
text = """
<span style="font-family: ABCDEE+Cambria,Bold; font-size:9px">Pinecone Functions 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:419px; top:1903px; width:76px; height:11px;"><span style="font-family: ABCDEE+Calibri,Bold; font-size:7px">Trainee Sign-Off 
<br></span></div>
"""
soup = BeautifulSoup(text, 'html.parser')
x = soup.find_all('span', style=re.compile(r'font-family: ABCDEE+Cambria,Bold.*'))
for rows in x:
    print(rows.text)

输出:

松果功能

最新更新