当每个评论都有单独的div id时,如何抓取用户评论



我正试图从team-bhp.com中抓取评论。然而,我注意到每个用户评论都有一个单独的div id

  1. xpath的形式为://*[@id="post_message_4655182"]
  2. html的形式为:<div id="post_message_4655182">
    我对使用任何库(如bs4或lxml(持开放态度,但我更喜欢python。我的代码:
import requests
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '///*[@id="post_message_4657893"]'
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)
tree = source_code.xpath(path) 
print(tree[0].text_content())

这将提供适当的输出,如:你好,Hajaar,我们最近完成了一笔宝马X1的交易。以下是我想分享的几件事:讨价还价很难
但在这里我已经硬编码了具体的评论id。如何从一个页面中提取所有评论?

调整XPATH并使用starts-with()来实现目标:

path = '//*[starts-with(@id,"post_message_")]'

示例

import requests    
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '//*[starts-with(@id,"post_message_")]'
source_code = html.fromstring(requests.get(url).content)
for e in source_code.xpath(path):
print(e.text_content())

或者标记有BeautifulSoup:时

import requests
from bs4 import BeautifulSoup
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
soup = BeautifulSoup(requests.get(url).content)
for e in soup.select('[id^="post_message_"]'):
print(e.get_text())

最新更新