美丽汤只提取前10个元素

我试图从kununu上的大众汽车页面中提取信息。例如"专业"信息。

url = 'https://www.kununu.com/de/volkswagen/kommentare'
page = requests.get(url)
soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="col-xs-12 col-lg-12")
for h2 in soup.find_all('h2', class_='h3', text=['Pro']):
print(h2.find_next_sibling('p').get_text())

但作为输出，我只有前 10 个"Pro"。看起来默认情况下它只显示前 10 条评论，但是所有不可见的评论都在"col-xs-12 col-lg-12"类下......或者也许我错过了一些东西你能帮我提取所有数据，而不仅仅是前 10 个吗？

您可以加载这些注释，模仿浏览器发送的XHR 请求以动态加载更多注释。

工作代码(注意：使用 f 字符串，所以 3.6+;如果使用早期的 Python 版本，请使用.format()(：

from bs4 import BeautifulSoup
import requests

comments = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
new_comments = [
pro.find_next_sibling('p').get_text()
for pro in soup.find_all('h2', text='Pro')
]
if not new_comments:
print(f"No more comments. Page: {page}")
break
comments += new_comments
# just to see current progress so far
print(comments)
print(len(comments))
page += 1
print(comments)

请注意我们如何实例化和使用requests.Session()对象，该对象在向同一主机发送多个请求时提供性能优势。

相关内容

最新更新

热门标签：