当我尝试解析网站上的下一页(以及之后的七页)时,Selenium 崩溃了.有什么办法可以解决这个问题吗?



我想解析位于此处的IMDb电影评级,大约8页。为了做到这一点,我正在使用Selenium,并且在点击时遇到问题,将算法继续到下一页。最后,当我继续使用BeautifulSoup时,我需要1000个标题。下面的代码不起作用,我需要在此HTML中使用"下一步"按钮:

<a class="flat-button lister-page-next next-page" href="/list/ls000004717/?page=2">
Next
</a>

这是代码:

from selenium import webdriver as wb
browser = wb.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
field = browser.find_element_by_name("flat-button lister-page-next next-page").click()

错误如下:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".flat-button lister-page-next next-page"}
(Session info: chrome=78.0.3904.108)

我想我缺乏所需的语法知识,或者我可能把它弄混了一点。我尝试搜索SO,尽管每个示例都非常独特,并且我不具备完全推断这些案例的知识。硒有什么办法可以处理吗?

您可以尝试使用 XPath 查询按钮内的Next文本。您可能还应该调用WebDriverWait,因为您正在跨多个页面导航,然后滚动到视图中,因为它位于页面底部:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from time import sleep

browser = wb.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
# keep clicking next until we reach the end
for i in range(0,9):
# wait up to 10s before locating next button
try:    
next_button = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'page') and contains(text(), 'Next')]")))
# scroll down to button using Javascript
browser.execute_script("arguments[0].scrollIntoView(true);", next_button)
# click the button
#    next_button.click() this throws exception -- replace with JS click
browser.execute_script("arguments[0].click();", next_button)
# I never recommend using sleep like this, but WebDriverWait is not waiting on next button to fully load, so it goes stale.
sleep(5)
# case: next button no longer exists, we have reached the end
except TimeoutException:
break

我还将所有内容包装在一个try/except TimeoutException块中,以处理我们已经到达页面末尾的情况,并且Next按钮不再存在,从而脱离了循环。这对我来说适用于多个页面。

我还必须添加一个明确的sleep(5)因为即使在element_to_be_clickable上调用WebDriverWait后,next_button仍然抛出StaleElementReferenceException。似乎WebDriverWait在页面完全加载之前就完成了,导致next_button的状态在找到后发生变化。通常添加sleep(5)是一种不好的做法,但这里似乎没有其他解决方法。如果其他人对此有建议,请随时评论/编辑答案。

有几种方法可以工作: 1. 使用下一个按钮的选择器并循环直到结束:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
browser = webdriver.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
selector = 'a[class*="next-page"]'
num_pages = 10
for page in range(pages):
# Wait for the element to load
WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.CSS_SELECTOR, selector)))
# ... Do rating parsing here
browser.find_element_by_css_selector(selector).click()

另一个选项不是单击元素,而是使用以下broswer.get('...')导航到下一页:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
# Set up browser as before and navigate to the page
browser = webdriver.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
selector = 'a[class*="next-page"]'
base_url = 'https://www.imdb.com/list/ls000004717/'
page_extension = '?page='
# Already at page = 1, so only needs to loop 9 times
for page in range(2, pages + 1):
# Wait for the page to load
WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.CSS_SELECTOR, selector)))
# ... Do rating parsing here
next_page = base_url + page_extension + str(page)
browser.get(next_page)

请注意:field = browser.find_element_by_name("...").click()不会field分配给 webelement,因为click()方法没有返回值。

你可以尝试使用部分 css 选择器。

browser.find_element_by_css_selector("a[class*='next-page']").click()

要单击带有文本的元素作为NEXT直到901 - 1,000 of 1,000页面,您必须:

  • scrollIntoView()元素,一旦达到visibility_of_element_located()
  • 诱导WebDriver 等待element_to_be_clickable()
  • 您可以使用以下解决方案:

    • 代码块:

      from selenium import webdriver
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.common.exceptions import TimeoutException
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:WebDriverschromedriver.exe')
      driver.get('https://www.imdb.com/list/ls000004717/')
      driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagination-range"))))
      while True:
      try:
      WebDriverWait(driver, 20).until(EC.invisibility_of_element((By.CSS_SELECTOR, "div.row.text-center.lister-working.hidden")))
      driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagination-range"))))
      WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.flat-button.lister-page-next.next-page"))).click()
      print("Clicked on NEXT button")
      except TimeoutException as e:
      print("No more NEXT button")
      break
      driver.quit()
      
    • 控制台输出:

      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      No more NEXT button
      

相关内容

  • 没有找到相关文章

最新更新