在网站上删除第二表



我正试图在这个网页上刮出第二张表(逐年团队每场比赛击球),但我只能刮出第一张表(按年团队击球)我研究了几种不同的刮汤方法,但都没有成功。下面是我尝试过的两种方法的代码。任何帮助、想法或想法都将不胜感激!

#1

import requests
bat_stats_url = "https://www.baseball-reference.com/teams/PHI/batteam.shtml"
data_b = requests.get(bat_stats_url)
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(data_b.text)
bat_stats_table = soup.select('table.stats_table')[0]
import pandas as pd
​
bat_year_stats = pd.read_html(data_b.text, match = 'Year-by-Year Team Batting')
bat_year_stats[0]

#2

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.baseball-reference.com/teams/PHI/batteam.shtml'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'Referer': 'https://www.nseindia.com/'}
r = requests.get(url,  headers=headers)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
cols = row.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
​
bat_year_stats[0].columns.values.tolist()
df = df.iloc[1:]
df = pd.DataFrame(output, columns = ['Year','Lg','W','L','Finish','R/G','G','PA','AB','R','H',
'2B','3B','HR','RBI','SB','CS','BB','SO','BA','OBP','SLG','OPS','E','DP','Fld%'])
df = df.iloc[1:]
df

表作为注释存储,因此pandas.read_html()在提取它之前无法找到它:

soup.find_all(string=lambda text: isinstance(text, Comment))

然后使用结果读取您的表:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="yby_team_bat_per_game"' in x][0])[0]
示例
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(requests.get('https://www.baseball-reference.com/teams/PHI/batteam.shtml').text)
pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="yby_team_bat_per_game"' in x][0])[0]
输出
OBPSLGOPS>td style="ext-align:right;">1.17>td style="text-align:right;">1.31>7.950.3188.330.3428.457.85
PAABRBBSOBA
033.994.868.431.614.630.638.47>0.2480.317133.124.531.628.650.242020年32.475.11.580.257334.394.788.970.2460.319433.484.189.380.2340.314

最新更新