我正在尝试从html表中提取信息(在此示例页面中找到 https://www.detrasdelafachada.com/house-for-sale-marianao-havana-cuba/dcyktckvwjxhpl9):
<div class="row">
<div class="col-label">
Type of property:
</div>
<div class="col-datos">
Apartment </div>
</div>
<div class="row">
<div class="col-label">
Building style:
</div>
<div class="col-datos">
50 year </div>
</div>
<div class="row">
<div class="col-label precio">
Sale price:
</div>
<div class="col-datos precio">
12 000 CUC </div>
</div>
<div class="row">
<div class="col-label">
Rooms:
</div>
<div class="col-datos">
1 </div>
</div>
<div class="row">
<div class="col-label">
Bathrooms:
</div>
<div class="col-datos">
1 </div>
</div>
<div class="row">
<div class="col-label">
Kitchens:
</div>
<div class="col-datos">
1 </div>
</div>
<div class="row">
<div class="col-label">
Surface:
</div>
<div class="col-datos">
38 mts2 </div>
</div>
<div class="row">
<div class="col-label">
Year of construction:
</div>
<div class="col-datos">
1945 </div>
</div>
<div class="row">
<div class="col-label">
Building style:
</div>
<div class="col-datos">
50 year </div>
</div>
<div class="row">
<div class="col-label">
Construction type:
</div>
<div class="col-datos">
Masonry and plate </div>
</div>
<div class="row">
<div class="col-label">
Home conditions:
</div>
<div class="col-datos">
Good </div>
</div>
<div class="row">
<div class="col-label">
Other peculiarities:
</div>
</div>
<div class="row">
使用美丽的汤,我怎样才能找到"建筑风格:"(以及其他条目)的价值?
我的问题是我直接找到该类,因为表中的所有条目都具有相同的div 类名。
您可以
遍历每一行div
并找到嵌套的div
值:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser')
results = [[re.sub('s{2,}|n+', '', i.text) for i in b.find_all('div')] for b in d.find_all('div', {'class':'row'})]
输出:
[['Type of property:', 'Apartment '], ['Building style:', '50 year '], ['Sale price:', '12 000 CUC '], ['Rooms:', '1 '], ['Bathrooms:', '1 '], ['Kitchens:', '1 '], ['Surface:', '38 mts2 '], ['Year of construction:', '1945 '], ['Building style:', '50 year '], ['Construction type:', 'Masonry and plate '], ['Home conditions:', 'Good '], ['Other peculiarities:'], []]
例如,如果您知道您特别想要查找字符串"建筑样式:",则可以捕获.next_sibling
的文本。 或者只是使用next
:
>>> from bs4 import BeautifulSoup
>>> html = "<c><div>hello</div> <div>hi</div></c>"
>>> soup = BeautifulSoup(html, 'html.parser')
>>> print(soup.find(string="hello").find_next('div').contents[0])
hi
如果你想要所有这些,你可以使用 .find_all
来获取类"row
"的所有div 标签,然后抓取每个的子标签。
data = []
soup = BeautifulSoup(html, 'html.parser')
for row in soup.find_all('div', class_="row"):
rowdata = [ c.text.strip() for c in row.find_all('div')]
data.append(rowdata)
print(data)
# Outputs the nested list:
# [u'Type of property:', u'Apartment'], [u'Building style:', u'50 year'], etc ]