BeautifulSoup无法检索所有数据

我正在尝试使用beautifulsoup检索reddit用户的所有评论。这是代码：

from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
url = "https://www.reddit.com/user/IHateTheLetterF/"
client = ureq(url)
page_html = client.read()

pagesoup = soup(page_html, "html5lib")
comments = pagesoup.findAll("p",{"class":"_1qeIAgB0cPwnLhDF9XSiJM"})

如果我试图检索一些评论，但由于某种原因，它只检索了16条评论，而用户显然有超过16条评论。我尝试过使用不同的解析器，如lxml、html.parser和html5lib，但它们都只检索到16条注释。奇怪的是，完全相同的代码昨天检索到了22条评论。如有任何帮助，将不胜感激

这是因为页面使用动态javascript加载注释内容。因此，您将无法使用ureq来完成它。

相反，您应该将selenium与网络驱动程序结合使用，以便在抓取之前加载所有注释。

您可以尝试在此处下载ChromeDriver可执行文件。如果你把它粘贴到与脚本相同的文件夹中，你就可以运行：

编辑：使用custon滚动条强制页面继续加载新评论

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup
# configure driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\chromedriver.exe"
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = "https://www.reddit.com/user/IHateTheLetterF/comments/"
driver.get(url)
scroll_pause_time = 3  # You can try with your own pause time.
screen_height = driver.execute_script("return window.screen.height;")  # get screen height
i = 1
while True:
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  # scroll screen height
i += 1  # times looped
time.sleep(scroll_pause_time)  # wait content to load
scroll_height = driver.execute_script("return document.body.scrollHeight;")  # check current scroll height

# Check your content
soup = BeautifulSoup(driver.page_source, "html.parser")
comments = soup.findAll("p", {"class": "_1qeIAgB0cPwnLhDF9XSiJM"})
print(len(comments))

# Finish loops when you cant scroll anymore
if (screen_height) * i > scroll_height:
break

我让它运行了几分钟，已经收到了200多条评论。

尽管这可能会起作用，但我建议您为此查找properAPI。

相关内容

最新更新

热门标签：