我正试图从team-bhp.com中抓取评论。然而,我注意到每个用户评论都有一个单独的div id
-
xpath的形式为:
//*[@id="post_message_4655182"]
-
html的形式为:
<div id="post_message_4655182">
我对使用任何库(如bs4或lxml(持开放态度,但我更喜欢python。我的代码:
import requests
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '///*[@id="post_message_4657893"]'
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)
tree = source_code.xpath(path)
print(tree[0].text_content())
这将提供适当的输出,如:你好,Hajaar,我们最近完成了一笔宝马X1的交易。以下是我想分享的几件事:讨价还价很难
但在这里我已经硬编码了具体的评论id。如何从一个页面中提取所有评论?
调整XPATH
并使用starts-with()
来实现目标:
path = '//*[starts-with(@id,"post_message_")]'
示例
import requests
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '//*[starts-with(@id,"post_message_")]'
source_code = html.fromstring(requests.get(url).content)
for e in source_code.xpath(path):
print(e.text_content())
或者标记有BeautifulSoup
:时
import requests
from bs4 import BeautifulSoup
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
soup = BeautifulSoup(requests.get(url).content)
for e in soup.select('[id^="post_message_"]'):
print(e.get_text())