从网站上抓取数据模糊数据python



我正试图从各个URL中抓取各个击球数据,这里有一个例子(https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&季节=2020(

它似乎隐藏了数据,或者我无法使用获取数据

driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
drounders=[]
for h in j.find_all('td'):
drounders.append(h.get_text())
print(drounders)

以下是预期的前几行

Game Date   Bat Team    Fld Team    Pitcher Result  EV (MPH)    LA (°)  Dist (ft)   Direction   Pitch (MPH) Pitch Type  
1   2020-08-12          Carrasco, Carlos    strikeout                           
2   2020-08-12          Carrasco, Carlos    strikeout                           
3   2020-08-12          Carrasco, Carlos    force_out               Opposite            
4   2020-08-11          Allen, Logan    force_out   107.8   -25 5   Pull    94.0    4-Seam Fastball 
5   2020-08-11          Allen, Logan    strikeout                   77.3    Curveball   
6   2020-08-11          Hill, Cam   sac_fly 100.5   42  345 Straightaway    91.6    4-Seam Fastball

我在这里看到的唯一问题是Bat Team列,因为该列包含图像而非文本。在这个答案中,我从Bat Team列和我在最后一个位置添加的列中刮取了图像的链接,如果你想忽略,那么从for loop中删除img

代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe') 
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
data = []
txt = str(trValue.text)
img =str(trValue.find("img"))
data = txt + img
finalData.append(data)
print(finalData)

输出:

['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13   Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13   Burnes, Corbin walk     89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13   Anderson, Brett hit_by_pitch     89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]

希望这能有所帮助,如果需要其他帮助,请告诉我