我试图从页面的所有 h3 和 h4 标签中提取文本并将其保存到 csv 文件中:
样本:
<div class="vc_column-inner">
<div class="wpb_wrapper">
<div class="wpb_text_column wpb_content_element ">
<div class="wpb_wrapper">
<h4>service text</h4>
</div>
</div>
<div class="wpb_text_column wpb_content_element ">
<div class="wpb_wrapper">
<h3 style="color: #2ac4ea; font-size: 35px;">2.900</h3>
</div>
</div>
</div>
</div>
我的代码:
service=[]
price=[]
url = 'www.site.com'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.content, 'html.parser')
for div in soup.findAll(class_='row'):
for div1 in div.findAll(class_='vc_column-inner'):
services=div1.find('h4')
prices=div1.find('h3')
service.append(services)
price.append(prices)
df = pd.DataFrame({'service':service,'price':price})
df.to_csv('results.csv', index=False, encoding='utf-8')
结果:
service,price
<h4>service text</h4>,"<h3 style=""color: #2ac4ea; font-size: 35px;"">2.900</h3>"
我需要这样:
service,price
service text,2.900
上述方法可以吗? 谢谢
将services
和prices
变量追加到列表时,请使用.get_text()
方法:
service.append(services.get_text(strip=True))
price.append(prices.get_text(strip=True))
然后结果将是:
service,price
service text,2.900
代码:
service=[]
price=[]
url = 'www.site.com'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.content, 'html.parser')
for div in soup.findAll(class_='row'):
for div1 in div.findAll(class_='vc_column-inner'):
services=div1.find('h4')
prices=div1.find('h3')
service.append(services.get_text(strip=True)) # <-- .get_text()
price.append(prices.get_text(strip=True)) # <-- .get_text()
df = pd.DataFrame({'service':service,'price':price})
df.to_csv('results.csv', index=False, encoding='utf-8')