我正在尝试提取一些youtube评论,并尝试了几种方法。
我的代码:
from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
DRIVER_PATH = <your chromedriver path>
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'
wd.get(url)
wait = WebDriverWait(wd, 100)
time.sleep(40)
v_title = wd.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print("title Is ")
print(v_title)
comments_xpath = '//h2[@id="count"]/yt-formatted-string/span[1]'
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
#wd.find_element_by_xpath(comments_xpath)
print(len(v_comm_cnt))
我得到以下错误:
selenium.common.exceptions.TimeoutException: Message:
我得到了title的正确值,但没有得到comment_cnt的正确值。有人能告诉我我的代码出了什么问题吗?
请注意,如果我在inspect元素中搜索值,comments count path-//h2[@id="count"]/yt格式的字符串/span[1]指向正确的位置。
更新的答案
嗯,这很棘手
这里有几个问题:
- 这个页面上有一些糟糕的java脚本,这使得Selenium Web驱动程序
driver.get()
方法等待页面加载超时,而页面看起来像是加载了。为了克服这个问题,我使用了Eager
页面加载策略 - 这个页面有几个相同区域的代码块,所以有时使用其中一个(可见(,有时使用第二个。这使得使用元素定位器变得困难。所以,我在这里等待标题元素从其中一个块的可见性。如果它是可见的,我会从那里提取文本,否则我会等待第二个元素的可见性(它会立即出现(,并从中提取文本
- 有几种方法可以使页面滚动。并非所有人都在这里工作。我发现一个正在工作并且滚动不多的
下面的代码100%有效,我运行了好几次
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("--start-maximized")
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "eager"
s = Service('C:webdriverschromedriver.exe')
driver = webdriver.Chrome(options=options, desired_capabilities=caps, service=s)
url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver, 10)
title_xpath = "//div[@class='style-scope ytd-video-primary-info-renderer']/h1"
alternative_title = "//*[@id='title']/h1"
v_title = ""
try:
v_title = wait.until(EC.visibility_of_element_located((By.XPATH, title_xpath))).text
except:
v_title = wait.until(EC.visibility_of_element_located((By.XPATH, alternative_title))).text
print("Title is " + v_title)
comments_xpath = "//div[@id='title']//*[@id='count']//span[1]"
driver.execute_script("window.scrollBy(0, arguments[0]);", 600)
try:
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
except:
pass
v_comm_cnt = driver.find_element(By.XPATH, comments_xpath).text
print("Video has " + v_comm_cnt + " comments")
输出为:
Title is Music for when you are stressed 🍀 Chil lofi | Music to Relax, Drive, Study, Chill
Video has 834 comments
Process finished with exit code 0