我想解析位于此处的IMDb电影评级,大约8页。为了做到这一点,我正在使用Selenium,并且在点击时遇到问题,将算法继续到下一页。最后,当我继续使用BeautifulSoup时,我需要1000个标题。下面的代码不起作用,我需要在此HTML中使用"下一步"按钮:
<a class="flat-button lister-page-next next-page" href="/list/ls000004717/?page=2">
Next
</a>
这是代码:
from selenium import webdriver as wb
browser = wb.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
field = browser.find_element_by_name("flat-button lister-page-next next-page").click()
错误如下:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".flat-button lister-page-next next-page"}
(Session info: chrome=78.0.3904.108)
我想我缺乏所需的语法知识,或者我可能把它弄混了一点。我尝试搜索SO,尽管每个示例都非常独特,并且我不具备完全推断这些案例的知识。硒有什么办法可以处理吗?
您可以尝试使用 XPath 查询按钮内的Next
文本。您可能还应该调用WebDriverWait
,因为您正在跨多个页面导航,然后滚动到视图中,因为它位于页面底部:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from time import sleep
browser = wb.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
# keep clicking next until we reach the end
for i in range(0,9):
# wait up to 10s before locating next button
try:
next_button = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'page') and contains(text(), 'Next')]")))
# scroll down to button using Javascript
browser.execute_script("arguments[0].scrollIntoView(true);", next_button)
# click the button
# next_button.click() this throws exception -- replace with JS click
browser.execute_script("arguments[0].click();", next_button)
# I never recommend using sleep like this, but WebDriverWait is not waiting on next button to fully load, so it goes stale.
sleep(5)
# case: next button no longer exists, we have reached the end
except TimeoutException:
break
我还将所有内容包装在一个try
/except TimeoutException
块中,以处理我们已经到达页面末尾的情况,并且Next
按钮不再存在,从而脱离了循环。这对我来说适用于多个页面。
我还必须添加一个明确的sleep(5)
因为即使在element_to_be_clickable
上调用WebDriverWait
后,next_button
仍然抛出StaleElementReferenceException
。似乎WebDriverWait
在页面完全加载之前就完成了,导致next_button
的状态在找到后发生变化。通常添加sleep(5)
是一种不好的做法,但这里似乎没有其他解决方法。如果其他人对此有建议,请随时评论/编辑答案。
有几种方法可以工作: 1. 使用下一个按钮的选择器并循环直到结束:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
browser = webdriver.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
selector = 'a[class*="next-page"]'
num_pages = 10
for page in range(pages):
# Wait for the element to load
WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.CSS_SELECTOR, selector)))
# ... Do rating parsing here
browser.find_element_by_css_selector(selector).click()
另一个选项不是单击元素,而是使用以下broswer.get('...')
导航到下一页:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
# Set up browser as before and navigate to the page
browser = webdriver.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
selector = 'a[class*="next-page"]'
base_url = 'https://www.imdb.com/list/ls000004717/'
page_extension = '?page='
# Already at page = 1, so only needs to loop 9 times
for page in range(2, pages + 1):
# Wait for the page to load
WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.CSS_SELECTOR, selector)))
# ... Do rating parsing here
next_page = base_url + page_extension + str(page)
browser.get(next_page)
请注意:field = browser.find_element_by_name("...").click()
不会将field
分配给 webelement,因为click()
方法没有返回值。
你可以尝试使用部分 css 选择器。
browser.find_element_by_css_selector("a[class*='next-page']").click()
要单击带有文本的元素作为NEXT直到901 - 1,000 of 1,000
页面,您必须:
scrollIntoView()
元素,一旦达到visibility_of_element_located()
。- 诱导WebDriver 等待
element_to_be_clickable()
-
您可以使用以下解决方案:
-
代码块:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options, executable_path=r'C:WebDriverschromedriver.exe') driver.get('https://www.imdb.com/list/ls000004717/') driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagination-range")))) while True: try: WebDriverWait(driver, 20).until(EC.invisibility_of_element((By.CSS_SELECTOR, "div.row.text-center.lister-working.hidden"))) driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagination-range")))) WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.flat-button.lister-page-next.next-page"))).click() print("Clicked on NEXT button") except TimeoutException as e: print("No more NEXT button") break driver.quit()
-
控制台输出:
Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button Clicked on NEXT button No more NEXT button
-