BeautifulSoup无法检索所有数据



我正在尝试使用beautifulsoup检索reddit用户的所有评论。这是代码:

from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
url = "https://www.reddit.com/user/IHateTheLetterF/"
client = ureq(url)
page_html = client.read()

pagesoup = soup(page_html, "html5lib")
comments = pagesoup.findAll("p",{"class":"_1qeIAgB0cPwnLhDF9XSiJM"})

如果我试图检索一些评论,但由于某种原因,它只检索了16条评论,而用户显然有超过16条评论。我尝试过使用不同的解析器,如lxml、html.parser和html5lib,但它们都只检索到16条注释。奇怪的是,完全相同的代码昨天检索到了22条评论。如有任何帮助,将不胜感激

这是因为页面使用动态javascript加载注释内容。因此,您将无法使用ureq来完成它。

相反,您应该将selenium与网络驱动程序结合使用,以便在抓取之前加载所有注释。

您可以尝试在此处下载ChromeDriver可执行文件。如果你把它粘贴到与脚本相同的文件夹中,你就可以运行:

编辑:使用custon滚动条强制页面继续加载新评论

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup
# configure driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\chromedriver.exe"
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = "https://www.reddit.com/user/IHateTheLetterF/comments/"
driver.get(url)
scroll_pause_time = 3  # You can try with your own pause time.
screen_height = driver.execute_script("return window.screen.height;")  # get screen height
i = 1
while True:
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  # scroll screen height
i += 1  # times looped
time.sleep(scroll_pause_time)  # wait content to load
scroll_height = driver.execute_script("return document.body.scrollHeight;")  # check current scroll height

# Check your content
soup = BeautifulSoup(driver.page_source, "html.parser")
comments = soup.findAll("p", {"class": "_1qeIAgB0cPwnLhDF9XSiJM"})
print(len(comments))

# Finish loops when you cant scroll anymore
if (screen_height) * i > scroll_height:
break 

我让它运行了几分钟,已经收到了200多条评论。

尽管这可能会起作用,但我建议您为此查找properAPI。

最新更新