我如何从9gag只刮图像帖子



我想刮掉第一个图片帖子和黑名单的url为下一个搜索,他跳过已经使用的url和搜索下一个图片帖子。我试着找到第一个图像,但它不工作。

driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

错误:回溯(最近一次调用):文件"C:UsersklausPycharmProjectstestTESTmain.py&quotgagposttitle = gagpost.find_element(By,value='img').get_attribute('alt')文件"C:UsersklausAppDataLocalProgramsPythonPython310libsite-packagesseleniumwebdriverremotewebelement.py",第763行,在find_element .py&quot返回self._execute (Command.FIND_CHILD_ELEMENT,文件"C:UsersklausAppDataLocalProgramsPythonPython310libsite-packagesseleniumwebdriverremotewebelement.py",第740行,in _execute回归自我。家长。执行(命令、参数)文件"C:UsersklausAppDataLocalProgramsPythonPython310libsite-packagesseleniumwebdriverremotewebdriver.py",第428行,在执行Response = self.command_executor.execute(driver_command, params)文件"C:UsersklausAppDataLocalProgramsPythonPython310libsite-packagesseleniumwebdriverremoteremote_connection.py",第345行,在执行中Data = utils.dump_json(params)文件"C:UsersklausAppDataLocalProgramsPythonPython310libsite-packagesseleniumwebdriverremoteutils.py",第23行,在dump_json中返回json.dumps (json_struct)文件"C:UsersklausAppDataLocalProgramsPythonPython310libjson_init.py",第231行,转储返回_default_encoder.encode (obj)文件"C:UsersklausAppDataLocalProgramsPythonPython310libjsonencoder.py",第199行,in encodeChunks = self。iterencode (o, _one_shot = True)文件"C:UsersklausAppDataLocalProgramsPythonPython310libjsonencoder.py",第257行,在iterencode返回_iterencode(0, 0)文件"C:UsersklausAppDataLocalProgramsPythonPython310libjsonencoder.py",第179行,默认值{0 .类型的对象名字。} 'TypeError: type类型的对象不是JSON可序列化的

进程结束,退出代码1

我也试过这个,有时有效,有时不。

driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

我很感激你的帮助。

你可以这样做:

from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
# Find articles within the stream; these are the 'posts'
articles = stream.find_elements(By.TAG_NAME, "article")
# Debug number of articles
print(f"Articles: {len(articles)}")
# Iterate over each article
for article in articles:
# Try/except here because some articles are adverts, these are skipped
try:
# Find the article title
title = article.find_element(By.CSS_SELECTOR, "header > a")
except NoSuchElementException:
continue
# Print the article title
print(f"Title: {title.text}")

打印出

Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests

这并没有打印出页面上的所有帖子,因为它们是惰性加载的。这意味着在滚动时从服务器获取帖子。要加载它们,需要在上面的代码中实现滚动功能。幸运的是,Python Selenium的文档中有一个针对这种特殊情况的示例。您还可以参考我之前的回答,了解实现的外观。

我只添加了足够的代码来获得标题,您可以从嵌入循环中的article变量中提取所需的其余信息。