如何使用Selenium打印Reddit消息组的最后一条消息



所以我从这行得到消息:

<pre class="_3Gy8WZD53wWAE41lr57by3 ">Sleep</pre>

我的代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = 'C:\Users\User\Desktop\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.reddit.com')
time.sleep(80) # TIME TO LOGIN IN
search = driver.find_element_by_class_name('_3Gy8WZD53wWAE41lr57by3 ')
print(driver.find_element_by_xpath(".//pre").text) # *LET'S CALL THIS 'S'*

一切都很正常。当我打印:s时,它会打印出聊天的最后一条消息。

请注意,每当有人输入消息时,它都将位于变量(类(下:"_3Gy8WZD53wWAE41lr57by3">

我的目标是打印出该聊天中的第一条消息。

我不得不编辑两次,因为我犯了一些错误

我建议对您的代码进行两次更改,这将为您省去主要的挫折:

  1. 避免显式sleep调用,而是等待元素的存在。这将允许您的程序尽可能少地等待您试图加载的页面
  2. 使用css选择器而不是xpath-->您可以更好地控制访问元素,此外,您的代码也变得更加健壮和灵活

在执行方面,它看起来是这样的:

最多等待80秒登录:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Get the page, now the user will need to log in
driver.get('https://www.reddit.com')
# Wait until the page is loaded, up to 80 seconds
try:
element = WebDriverWait(driver, 80).until(
EC.presence_of_element_located((By. CSS_SELECTOR, "pre. _3Gy8WZD53wWAE41lr57by3"))
)
except TimeoutException:
print("You didn't log in, shutting down program")
driver.quit()
# continue as normal here

利用css选择器查找消息:

# I personally like to always use the plural form of this function
# since, if it fails, it returns an empty list. The single form of
# this function results in an error if no results are found
# NOTE: utilize reddit's class for comments, you may need to change the css selector
all_messages = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3')
# You can now access the first and last elements from this:
first_message = all_messages[0].text
last_message = all_messages[-1].text
# Alternatively, if you are concerned about memory usage from potentially large
# lists of messages, use css selector 'nth-of-type' 
# NOTE: accessing first instance of list of the list exists allows None
# if no results are found
first_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:first-of-type')
first_message = first_message[0] if first_message else None
last_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:last-of-type')
last_message = last_message[0] if last_message else None

我希望这能提供一个即时的解决方案,但也能提供一些如何优化web抓取的基本原理

最新更新