我正试图从各个URL中抓取各个击球数据,这里有一个例子(https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&季节=2020(
它似乎隐藏了数据,或者我无法使用获取数据
driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
drounders=[]
for h in j.find_all('td'):
drounders.append(h.get_text())
print(drounders)
以下是预期的前几行
Game Date Bat Team Fld Team Pitcher Result EV (MPH) LA (°) Dist (ft) Direction Pitch (MPH) Pitch Type
1 2020-08-12 Carrasco, Carlos strikeout
2 2020-08-12 Carrasco, Carlos strikeout
3 2020-08-12 Carrasco, Carlos force_out Opposite
4 2020-08-11 Allen, Logan force_out 107.8 -25 5 Pull 94.0 4-Seam Fastball
5 2020-08-11 Allen, Logan strikeout 77.3 Curveball
6 2020-08-11 Hill, Cam sac_fly 100.5 42 345 Straightaway 91.6 4-Seam Fastball
我在这里看到的唯一问题是Bat Team列,因为该列包含图像而非文本。在这个答案中,我从Bat Team列和我在最后一个位置添加的列中刮取了图像的链接,如果你想忽略,那么从for loop
中删除img
代码:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe')
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
data = []
txt = str(trValue.text)
img =str(trValue.find("img"))
data = txt + img
finalData.append(data)
print(finalData)
输出:
['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13 Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13 Burnes, Corbin walk 89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13 Anderson, Brett hit_by_pitch 89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]
希望这能有所帮助,如果需要其他帮助,请告诉我