所以我从这行得到消息:
<pre class="_3Gy8WZD53wWAE41lr57by3 ">Sleep</pre>
我的代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = 'C:\Users\User\Desktop\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.reddit.com')
time.sleep(80) # TIME TO LOGIN IN
search = driver.find_element_by_class_name('_3Gy8WZD53wWAE41lr57by3 ')
print(driver.find_element_by_xpath(".//pre").text) # *LET'S CALL THIS 'S'*
一切都很正常。当我打印:s时,它会打印出聊天的最后一条消息。
请注意,每当有人输入消息时,它都将位于变量(类(下:"_3Gy8WZD53wWAE41lr57by3">
我的目标是打印出该聊天中的第一条消息。
我不得不编辑两次,因为我犯了一些错误
我建议对您的代码进行两次更改,这将为您省去主要的挫折:
- 避免显式
sleep
调用,而是等待元素的存在。这将允许您的程序尽可能少地等待您试图加载的页面 - 使用css选择器而不是xpath-->您可以更好地控制访问元素,此外,您的代码也变得更加健壮和灵活
在执行方面,它看起来是这样的:
最多等待80秒登录:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Get the page, now the user will need to log in
driver.get('https://www.reddit.com')
# Wait until the page is loaded, up to 80 seconds
try:
element = WebDriverWait(driver, 80).until(
EC.presence_of_element_located((By. CSS_SELECTOR, "pre. _3Gy8WZD53wWAE41lr57by3"))
)
except TimeoutException:
print("You didn't log in, shutting down program")
driver.quit()
# continue as normal here
利用css选择器查找消息:
# I personally like to always use the plural form of this function
# since, if it fails, it returns an empty list. The single form of
# this function results in an error if no results are found
# NOTE: utilize reddit's class for comments, you may need to change the css selector
all_messages = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3')
# You can now access the first and last elements from this:
first_message = all_messages[0].text
last_message = all_messages[-1].text
# Alternatively, if you are concerned about memory usage from potentially large
# lists of messages, use css selector 'nth-of-type'
# NOTE: accessing first instance of list of the list exists allows None
# if no results are found
first_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:first-of-type')
first_message = first_message[0] if first_message else None
last_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:last-of-type')
last_message = last_message[0] if last_message else None
我希望这能提供一个即时的解决方案,但也能提供一些如何优化web抓取的基本原理