使用beautifulsoup解析sport-reference的数据



我是Python新手,在这个项目中,我想创建一个循环来解析所有NFL球队的排名数据https://www.pro-football-reference.com/teams/。我首先创建了一个数据框架作为目录,如下所示:

my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atalanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
['dal','Dalls_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
['sgd','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
['phi','Philidophia_Eagles'],['pt','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])
team_list = pd.DataFrame(my_array, columns=['code','teams'])

下面是我用来解析所有32个网页的循环:

url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [url_base+str(i) for i in team_list['code']]
for url in url_list:
page = requests.get(url).text
soup = bs(page)
for table in soup.find_all('table'):
headers = []
for i in table.find_all('th', scope = "col"):
title=i.text.strip()
headers.append(title)
table_data = []
for tr in table.find_all("tr"): 
t_row = {}
for td, th in zip(tr.find_all("td"), headers): 
t_row[th] = td.text.replace('n', '').strip()
table_data.append(t_row)

然而,结果却是一个空列表。我的代码有什么问题吗?谢谢!

下面是不使用pandas'.read_html()的逻辑

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atlanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
['dal','Dallas_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
['sdg','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
['phi','Philadelphia_Eagles'],['pit','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])

url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [(url_base+str(i[0]), i[1]) for i in my_array]
rows = []
for url, team in url_list:
print('Gathering: %s' %team)
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find('table', {'id':'team_index'})
headers = [x.text.strip() for x in table.find_all('tr')[1].find_all('th')]

trs = table.find_all('tr')[2:]

for tr in trs:
year = tr.find('th').text.strip()
if year == 'Year' or year == '':
continue
data = [year] + [x.text.strip() for x in tr.find_all('td')]

rows.append(data)

final_table = pd.DataFrame(rows, columns=headers)

你只需要修复缩进并在循环外定义table_data

url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [url_base+str(i) for i in team_list['code']]
table_data = []
for url in url_list:
page = requests.get(url).text
soup = bs(page)
for table in soup.find_all('table'):
headers = []
for i in table.find_all('th', scope = "col"):
title=i.text.strip()
headers.append(title)
for tr in table.find_all("tr"): 
t_row = {}
for td, th in zip(tr.find_all("td"), headers): 
t_row[th] = td.text.replace('n', '').strip()
table_data.append(t_row)

代码有几个错误:

首先(正如已经提到的),你的缩进不在这里,所以你需要修复它。在创建soup对象时,需要在第一个循环中解析表。其次,zip在这里返回一个对象,您需要做一些类似list(zip(x, y))的事情来做您正在尝试做的事情,即迭代它。第三,即使这样做,当您在这里使用zip(创建您的字典)时,您也希望使用标题作为键,而不是作为值。第四,标题是多索引的,所以当您使用tds进行压缩时,它们并不完全对齐。第五,你需要在循环之前初始化你的table_data,否则它只会在迭代过程中覆盖自己。

最后,考虑使用pandas'.read_html()。它在底层使用了beautifulsoup,可以为您解析表,然后您只需要做最少的工作来清理表。

此外,我修复了数组中的一些错误(您也可以从https://www.pro-football-reference.com/teams/表中抓取那些hrefs和团队名称,但您如何硬编码它应该可以正常工作,并且这些链接不会很快改变,如果有的话):

  1. 'Atalanta_Falcons'->'Atlanta_Falcons'
  2. 'Dalls_Cowboys'->'Dallas_Cowboys'
  3. 'Philidophia Eagles'->'Philadelphia Eagles'
  4. 'sgd'->'sdg'
  5. 'pt'->'pit'

代码:

import pandas as pd
import numpy as np
my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atlanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
['dal','Dallas_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
['sdg','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
['phi','Philadelphia_Eagles'],['pit','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])

final_table = pd.DataFrame()
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [(url_base+str(i[0]), i[1]) for i in my_array]
for url, team in url_list:
print('Gathering: %s' %team)

# Gets full unfiltered table
table = pd.read_html(url, header=1)[0]

#Drop those sub header rows
table = table[table['Year'].ne('Year')]

#Drop the null rows
table = table.dropna(subset = ['Year'])

# Append to your final dataframe
final_table = final_table.append(table, sort=False).reset_index(drop=True)

输出:

print(final_table)
Year   Lg                 Tm  W   L  ...    MoV   SoS    SRS  OSRS  DSRS
0     2021  NFL  Arizona Cardinals  0   0  ...    NaN   NaN    NaN   NaN   NaN
1     2020  NFL  Arizona Cardinals  8   8  ...    2.7  -0.1    2.6   1.5   1.0
2     2019  NFL  Arizona Cardinals  5  10  ...   -5.1   1.8   -3.2  -0.3  -2.9
3     2018  NFL  Arizona Cardinals  3  13  ...  -12.5   1.0  -11.5  -9.6  -1.9
4     2017  NFL  Arizona Cardinals  8   8  ...   -4.1   0.4   -3.7  -4.0   0.2
...  ...                ... ..  ..  ...    ...   ...    ...   ...   ...
2089  1936  NFL   Boston Redskins*  7   5  ...    3.3  -3.0    0.3  -1.0   1.3
2090  1935  NFL    Boston Redskins  2   8  ...   -5.3  -0.8   -6.1  -6.1   0.0
2091  1934  NFL    Boston Redskins  6   6  ...    1.1  -0.8    0.2  -1.7   2.0
2092  1933  NFL    Boston Redskins  5   5  ...    0.5   1.4    1.9  -0.8   2.7
2093  1932  NFL      Boston Braves  4   4  ...   -2.4  -1.6   -4.0  -4.0  -0.1
[2094 rows x 29 columns]

最新更新