我正在尝试使用beautifulsoup从html文件中提取字符串。一个查询回复标签标签里面,我怎么能摆脱这些标签。
from bs4 import BeautifulSoup
import requests
with open('/Desktop/filename.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
string = soup.find('div', class_="col-sm-8 col-xs-6")
print(string)
输出——
<div class="col-sm-8 col-xs-6">
Sherlock Holmes <br>
<label for="AgentAddress" style="display: none;">
Detective's Address
</label>
221B Baker Street London <br>
<label for="AgentCityStateZip" style="display: none;">
City, State, Zip
</label>
London, United Kingdom
</div>
print(string.text)
输出
Sherlock Holmes
Detective's Address
221B Baker Street London
City, State, Zip
London, United Kingdom
我对<label></label>
标签内的文本不感兴趣,我怎样才能摆脱它们,使输出是-
Sherlock Holmes
221B Baker Street London
London, United Kingdom
您可以尝试分解,例如,在打印之前使用此:
for label_element in string.find_all("label"):
label_element.decompose()