如何将find_all作为text/string进行web抓取-Python web抓取问题



我想废弃一个网站。你能告诉我如何只获得这种格式的输出文本吗;纯电动汽车,Enyaq CoupéiV vRS,斯柯达,英国,大众"?目前,我的输出还包括HTML标签等

感谢您的投入!

from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.electrive.com/2022/02/13/skoda-reveals-uk-pricing-for-enyaq-coupe-iv-vrs/').text
soup = BeautifulSoup(source, 'lxml')
article = soup.find()
tags2 = article.find_all('div', class_='tags')
print (tags2)

输出:

[<div class="tags">
<a href="https://www.electrive.com/tag/bev/" rel="tag">BEV</a><a href="https://www.electrive.com/tag/enyaq-coupe-iv-vrs/" rel="tag">Enyaq Coupé iV vRS</a><a href="https://www.electrive.com/tag/skoda/" rel="tag">Skoda</a><a href="https://www.electrive.com/tag/uk/" rel="tag">UK</a><a href="https://www.electrive.com/tag/volkswagen/" rel="tag">Volkswagen</a> </div>]
[Finished in 580ms]

您必须选择更具体的元素,原因信息在<a>中,并在ResultSet上迭代,例如使用list comprehension:

tags2 = [e.text for e in soup.find('div', class_='tags').find_all('a')]

css selectors:的替代使用

tags2 = [e.text for e in soup.select('div.tags a')]
#output
['BEV', 'Enyaq Coupé iV vRS', 'Skoda', 'UK', 'Volkswagen']

如果你想得到一个字符串而不是列表,只需join()元素:

tags2 = ','.join([e.text for e in soup.find('div', class_='tags').find_all('a')])
#output
BEV,Enyaq Coupé iV vRS,Skoda,UK,Volkswagen

最新更新