Python:使用bs4提取某些值



HTML:

<div class="col-7"> 
<dl class="row box">
<h2>GENERAL</h2>
<dt class="col-6">transmission:</dt>
<dd class="col-6">sequential automatic</dd>
<dt class="col-6 grey">number of seats:</dt>
<dd class="col-6">5</dd>
<dt class="col-6">first year of production:</dt>
<dd class="col-6">2017</dd>
<dt class="col-6 grey">last year of production:</dt>
<dd class="col-6">available</dd>
</dl>
<dl class="row box">
<h2>DRIVE</h2>
<dt class="col-6">fuel:</dt>
<dd class="col-6">petrol</dd>
<dt class="col-6 grey">total maximum power:</dt>
<dd class="col-6">147 kW (200 hp)</dd>
<dt class="col-6">total maximum torque:</dt>
<dd class="col-6">330 Nm</dd>
</dl>
<dl class="row box">
<h2>TRANSMISSION</h2>
<dt class="col-6">1st gear:</dt>
<dd class="col-6">5,00:1</dd>
<dt class="col-6 grey">2nd gear:</dt>
<dd class="col-6">3,20:1</dd>
</dl>
</div>

我的代码:

for item2 in soup2.find_all(attrs={'class':'col-7'}):
jj=item2.text

jj可以从我刮来的网站上提取所有的值,但我只需要从中提取一些值。例如,我只需要提取GENERAL的座位数量和去年的产量值,以及TRANSMISSION的1档值。

结果应该是:

5, available, 5,00:1

您需要的信息只是标题"座位数量"、"去年生产"one_answers"1档"的下一项,因此您可以使用zip循环浏览该项和下一项

all_items = soup.find_all(attrs={'class':'col-6'})
titles = [
"number of seats", 
"last year of production", 
"1st gear"
]
d = {title: [] for title in titles}
for item, next_item in zip(all_items, all_items[1:]):
for title in titles:
if title in item.text:
d[title].append(next_item.text)
break

然后d将包含您需要的所有信息

更改find_values元组以从html文本中获取值

从bs4进口BeautifulSoup汤=BeautifulSoup(html,'html.parser'(find_values=('座椅数量','生产的最后一年','1档'(对于汤里的i。find_all(attrs={‘class’:‘row box’}(:对于i.find_all('dt'(中的j:text=j.get_text((.lower((.strip((如果text.startswith(find_values(:print(text,j.find_next_sbling('dd'(.get_text(((

最新更新