硒随机滚动



所以我试图从网站上几百页的表格中抓取数据。以下是我目前所拥有的部分内容:

driver.get("link")
driver.maximize_window()
window_before = driver.window_handles[0]
driver.switch_to.window(window_before)
wait = WebDriverWait(driver, 10)
driver.execute_script("window.scrollTo(0, 350)")
games = driver.find_elements(By.XPATH, '//*[@id="schedule"]/tbody/tr')

此代码仅在某些情况下有效。如果我运行这个区块10次,网站只会向下滚动5次。我试过使用这个:

for i in range(0, 2): driver.find_element(By.XPATH, '//*[@id="meta"]/div[1]/p[1]/a').send_keys(Keys.DOWN)

但同样的问题也出现了。有时它会向下滚动我需要的数量,有时它什么都不做,有时它会滚动整个页面。

我的这部分代码导航到我需要单击的第一个链接,在下一页上,我需要滚动另一个页面,在那里也存在相同的问题。这都是一个循环的一部分,这个循环要遍历几百页来读取html表,所以即使前50次有效,我也无法获得所需的所有数据。

编辑:直接在上面的片段之后,我有这个:

for idx, game in enumerate(games):
driver.find_element(By.XPATH, '/html/body/div[2]/div[6]/div[3]/div[2]/table/tbody/tr['+str(idx+1)+']/td[6]/a').click()

这就是我得到";元素在点(X;错误

我是在这里做错了什么,还是有工作可以实现我的目标?

以下是从该页面访问每个"框得分"链接的href属性的一种方法(根据OP在评论中的澄清(:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)
actions = ActionChains(browser)
url = 'https://www.basketball-reference.com/leagues/NBA_2014_games-october.html'
browser.get(url)
# print(browser.page_source)
# browser.maximize_window()
try:
wait.until(EC.element_to_be_clickable((By.XPATH, '//div[@class="qc-cmp2-summary-section"]'))).click()
print('clicked cookie parent')
wait.until(EC.element_to_be_clickable((By.XPATH, '//button[@mode="primary"]'))).click()
print('accepted cookies')
except Exception as e:
print('no cookies')
wait.until(EC.element_to_be_clickable((By.XPATH, '//div[@id="all_schedule"]'))).location_once_scrolled_into_view
table_with_score_links = wait.until(EC.presence_of_element_located((By.XPATH, '//table[@id="schedule"]')))
# print(table_with_score_links.get_attribute('outerHTML'))
links_from_table = [x.get_attribute('href') for x in table_with_score_links.find_elements(By.TAG_NAME, 'a') if x.text == 'Box Score']
print(links_from_table)

终端打印结果:

clicked cookie parent
accepted cookies
['https://www.basketball-reference.com/boxscores/201310290IND.html', 'https://www.basketball-reference.com/boxscores/201310290MIA.html', 'https://www.basketball-reference.com/boxscores/201310290LAL.html', 'https://www.basketball-reference.com/boxscores/201310300CLE.html', 'https://www.basketball-reference.com/boxscores/201310300TOR.html', 'https://www.basketball-reference.com/boxscores/201310300PHI.html', 'https://www.basketball-reference.com/boxscores/201310300DET.html', 'https://www.basketball-reference.com/boxscores/201310300NYK.html', 'https://www.basketball-reference.com/boxscores/201310300NOP.html', 'https://www.basketball-reference.com/boxscores/201310300MIN.html', 'https://www.basketball-reference.com/boxscores/201310300HOU.html', 'https://www.basketball-reference.com/boxscores/201310300SAS.html', 'https://www.basketball-reference.com/boxscores/201310300DAL.html', 'https://www.basketball-reference.com/boxscores/201310300UTA.html', 'https://www.basketball-reference.com/boxscores/201310300PHO.html', 'https://www.basketball-reference.com/boxscores/201310300SAC.html', 'https://www.basketball-reference.com/boxscores/201310300GSW.html', 'https://www.basketball-reference.com/boxscores/201310310CHI.html', 'https://www.basketball-reference.com/boxscores/201310310LAC.html']

我试图使变量名称尽可能具有描述性,还留下了一些注释掉的代码行,以帮助进行思考过程,从而达到最终目标。

你现在可以一个接一个地浏览这些链接,等等。

Selenium文档可以在这里找到:https://www.selenium.dev/documentation/

最新更新