当每个评论都有单独的div id时，如何抓取用户评论

我正试图从team-bhp.com中抓取评论。然而，我注意到每个用户评论都有一个单独的div id

xpath的形式为：//*[@id="post_message_4655182"]
html的形式为：<div id="post_message_4655182">
我对使用任何库(如bs4或lxml(持开放态度，但我更喜欢python。我的代码：

import requests
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '///*[@id="post_message_4657893"]'
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)
tree = source_code.xpath(path) 
print(tree[0].text_content())

这将提供适当的输出，如：你好，Hajaar，我们最近完成了一笔宝马X1的交易。以下是我想分享的几件事：讨价还价很难
但在这里我已经硬编码了具体的评论id。如何从一个页面中提取所有评论？

调整XPATH并使用starts-with()来实现目标：

path = '//*[starts-with(@id,"post_message_")]'

示例

import requests    
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '//*[starts-with(@id,"post_message_")]'
source_code = html.fromstring(requests.get(url).content)
for e in source_code.xpath(path):
print(e.text_content())

或者标记有BeautifulSoup:时

import requests
from bs4 import BeautifulSoup
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
soup = BeautifulSoup(requests.get(url).content)
for e in soup.select('[id^="post_message_"]'):
print(e.get_text())

示例

相关内容

最新更新

热门标签：