我正试图使用Python和Selenium从网站上动态抓取加载的数据。问题是,只有大约一半的数据被报告为存在,而实际上所有数据都应该存在。即使在打印出所有页面内容之前使用暂停,或者简单地逐类查找元素搜索,似乎也没有解决方案。网站的URL为https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909.正如你所看到的,共有13个主要部分,但我只能从前四场比赛中检索数据。为了最好地显示这个问题,我将附加用于打印整个页面的内部HTML的代码,以显示加载数据和未加载数据之间的差异。
from selenium import webdriver
import requests
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))
编辑:问题不在于等待时间,因为我正在一行一行地运行它,并完全等待它加载。问题似乎归结为selenium没有获取页面上所有JS加载的文本,如下面答案中的控制台输出所示。
@sudonym的分析方向是正确的。您需要诱导WebDriverWait使所需的元素可见,然后尝试通过execute_script()
方法提取它们,如下所示:
-
代码块:
# -*- coding: UTF-8 -*- from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909" driver = webdriver.Chrome() driver.get(url) WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]"))) print(driver.execute_script("return document.documentElement.innerText;"))
-
控制台输出:
SPORTSBOOK REVIEW Home Best Sportsbooks Rating Guide Blacklist Bonuses BETTING ODDS FREE PICKS Sports Picks NFL College Football NBA NCAAB MLB NHL More Sports How to Bet Tools FORUM Home Players Talk Sportsbooks & Industry Newbie Forum Handicapper Think Tank David Malinsky's Point Blank Service Plays Bitcoin Sports Betting NBA Betting NFL Betting NCAAF Betting MLB Betting NHL Betting CONTESTS EARN BETPOINTS What Are Betpoints? SBR Sportsbook SBR Casino SBR Racebook SBR Poker SBR Store Today NFL NBA NHL MLB College Football NCAA Basketball Soccer Soccer Odds Major League Soccer UEFA Champions League UEFA Nations League UEFA Europa League English Premier League World Cup 2022 Tennis Tennis Odds ATP WTA UFC Boxing More Sports CFL WNBA AFL Betting Odds/NFL Odds/Consensus TODAY | YESTERDAY | DATE ? Login ? Settings ? Bet Tracker ? Bet Card ? Favorites NFL Consensus for Sep 09, 2018 USA - National Football League Sunday Sep 09, 2018 01:00 PM / Pittsburgh vs Cleveland 453 Pittsburgh 454 Cleveland Current Line -3½+105 +3½-115 Wagers Placed 10040 54.07% 8530 45.93% Amount Wagered $381,520.00 56.10% $298,550.00 43.90% Average Bet Size $38.00 $35.00 SBR Contest Best Bets 22 9 01:00 PM / San Francisco vs Minnesota 455 San Francisco 456 Minnesota Current Line +6-102 -6-108 Wagers Placed 6250 41.25% 8900 58.75% Amount Wagered $175,000.00 29.50% $418,300.00 70.50% Average Bet Size $28.00 $47.00 SBR Contest Best Bets 5 19 01:00 PM / Cincinnati vs Indianapolis 457 Cincinnati 458 Indianapolis Current Line -1-104 +1-106 Wagers Placed 11640 66.36% 5900 33.64% Amount Wagered $1,338,600.00 85.65% $224,200.00 14.35% Average Bet Size $115.00 $38.00 SBR Contest Best Bets 23 12 01:00 PM / Buffalo vs Baltimore 459 Buffalo 460 Baltimore Current Line +7½-103 -7½-107 Wagers Placed 5220 33.83% 10210 66.17% Amount Wagered $78,300.00 16.79% $387,980.00 83.21% Average Bet Size $15.00 $38.00 SBR Contest Best Bets 5 17 01:00 PM / Jacksonville vs N.Y. Giants 461 Jacksonville 462 N.Y. Giants 01:00 PM / Tampa Bay vs New Orleans 463 Tampa Bay 464 New Orleans 01:00 PM / Houston vs New England 465 Houston 466 New England 01:00 PM / Tennessee vs Miami 467 Tennessee 468 Miami 04:05 PM / Kansas City vs L.A. Chargers 469 Kansas City 470 L.A. Chargers 04:25 PM / Seattle vs Denver 471 Seattle 472 Denver 04:25 PM / Dallas vs Carolina 473 Dallas 474 Carolina 04:25 PM / Washington vs Arizona 475 Washington 476 Arizona 08:20 PM / Chicago vs Green Bay 477 Chicago 478 Green Bay Media Site Map Terms of use Contact Us Privacy Policy DMCA 18+. Gamble Responsibly. © Sportsbook Review. All Rights Reserved.
如果有很多WebDriverWait调用,则该解决方案仅值得考虑考虑到对缩短运行时间的兴趣-其他选择DebanjanB接近
您需要等待一段时间才能完全加载html。此外,您还可以设置脚本执行的超时。为硒中的driver.get(URL
(、driver.set_page_load_timeout(n)
、n = time/seconds
和循环添加无条件等待:
driver.set_page_load_timeout(n) # Set timeout of n seconds for page load
loading_finished = 0 # Set flag to 0
while loading_finished == 0: # Repeat while flag = 0
try:
sleep(random.uniform(0.1, 0.5)) # wait some time
website = driver.get(URL) # try to load for n seconds
loading_finished = 1 # Set flag to 1 and exit while loop
logger.info("website loaded") # Indicate load success
except:
logger.warn("timeout - retry") # Indicate load fail
else: # If flag == 1
driver.set_script_timeout(n) # Set timeout of n seconds for script
script_finished = 0 # Set flag to 0
while script_finished == 0 # Second loop
try:
print driver.execute_script("return document.documentElement.innerText;")
script_finished = 1 # Set flag to 1
logger.info("script done") # Indicate script done
except:
logger.warn("script timeout")
else:
logger.info("if you're still missing html here, increase timeout")