我是Python新手,在这个项目中,我想创建一个循环来解析所有NFL球队的排名数据https://www.pro-football-reference.com/teams/。我首先创建了一个数据框架作为目录,如下所示:
my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atalanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
['dal','Dalls_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
['sgd','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
['phi','Philidophia_Eagles'],['pt','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])
team_list = pd.DataFrame(my_array, columns=['code','teams'])
下面是我用来解析所有32个网页的循环:
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [url_base+str(i) for i in team_list['code']]
for url in url_list:
page = requests.get(url).text
soup = bs(page)
for table in soup.find_all('table'):
headers = []
for i in table.find_all('th', scope = "col"):
title=i.text.strip()
headers.append(title)
table_data = []
for tr in table.find_all("tr"):
t_row = {}
for td, th in zip(tr.find_all("td"), headers):
t_row[th] = td.text.replace('n', '').strip()
table_data.append(t_row)
然而,结果却是一个空列表。我的代码有什么问题吗?谢谢!
下面是不使用pandas
'.read_html()
的逻辑
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atlanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
['dal','Dallas_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
['sdg','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
['phi','Philadelphia_Eagles'],['pit','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [(url_base+str(i[0]), i[1]) for i in my_array]
rows = []
for url, team in url_list:
print('Gathering: %s' %team)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'team_index'})
headers = [x.text.strip() for x in table.find_all('tr')[1].find_all('th')]
trs = table.find_all('tr')[2:]
for tr in trs:
year = tr.find('th').text.strip()
if year == 'Year' or year == '':
continue
data = [year] + [x.text.strip() for x in tr.find_all('td')]
rows.append(data)
final_table = pd.DataFrame(rows, columns=headers)
你只需要修复缩进并在循环外定义table_data
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [url_base+str(i) for i in team_list['code']]
table_data = []
for url in url_list:
page = requests.get(url).text
soup = bs(page)
for table in soup.find_all('table'):
headers = []
for i in table.find_all('th', scope = "col"):
title=i.text.strip()
headers.append(title)
for tr in table.find_all("tr"):
t_row = {}
for td, th in zip(tr.find_all("td"), headers):
t_row[th] = td.text.replace('n', '').strip()
table_data.append(t_row)
代码有几个错误:
首先(正如已经提到的),你的缩进不在这里,所以你需要修复它。在创建soup对象时,需要在第一个循环中解析表。其次,zip
在这里返回一个对象,您需要做一些类似list(zip(x, y))
的事情来做您正在尝试做的事情,即迭代它。第三,即使这样做,当您在这里使用zip(创建您的字典)时,您也希望使用标题作为键,而不是作为值。第四,标题是多索引的,所以当您使用td
s进行压缩时,它们并不完全对齐。第五,你需要在循环之前初始化你的table_data
,否则它只会在迭代过程中覆盖自己。
最后,考虑使用pandas
'.read_html()
。它在底层使用了beautifulsoup,可以为您解析表,然后您只需要做最少的工作来清理表。
此外,我修复了数组中的一些错误(您也可以从https://www.pro-football-reference.com/teams/
表中抓取那些hrefs和团队名称,但您如何硬编码它应该可以正常工作,并且这些链接不会很快改变,如果有的话):
'Atalanta_Falcons'
->'Atlanta_Falcons'
'Dalls_Cowboys'
->'Dallas_Cowboys'
'Philidophia Eagles'
->'Philadelphia Eagles'
'sgd'
->'sdg'
'pt'
->'pit'
代码:
import pandas as pd
import numpy as np
my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atlanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
['dal','Dallas_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
['sdg','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
['phi','Philadelphia_Eagles'],['pit','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])
final_table = pd.DataFrame()
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [(url_base+str(i[0]), i[1]) for i in my_array]
for url, team in url_list:
print('Gathering: %s' %team)
# Gets full unfiltered table
table = pd.read_html(url, header=1)[0]
#Drop those sub header rows
table = table[table['Year'].ne('Year')]
#Drop the null rows
table = table.dropna(subset = ['Year'])
# Append to your final dataframe
final_table = final_table.append(table, sort=False).reset_index(drop=True)
输出:
print(final_table)
Year Lg Tm W L ... MoV SoS SRS OSRS DSRS
0 2021 NFL Arizona Cardinals 0 0 ... NaN NaN NaN NaN NaN
1 2020 NFL Arizona Cardinals 8 8 ... 2.7 -0.1 2.6 1.5 1.0
2 2019 NFL Arizona Cardinals 5 10 ... -5.1 1.8 -3.2 -0.3 -2.9
3 2018 NFL Arizona Cardinals 3 13 ... -12.5 1.0 -11.5 -9.6 -1.9
4 2017 NFL Arizona Cardinals 8 8 ... -4.1 0.4 -3.7 -4.0 0.2
... ... ... .. .. ... ... ... ... ... ...
2089 1936 NFL Boston Redskins* 7 5 ... 3.3 -3.0 0.3 -1.0 1.3
2090 1935 NFL Boston Redskins 2 8 ... -5.3 -0.8 -6.1 -6.1 0.0
2091 1934 NFL Boston Redskins 6 6 ... 1.1 -0.8 0.2 -1.7 2.0
2092 1933 NFL Boston Redskins 5 5 ... 0.5 1.4 1.9 -0.8 2.7
2093 1932 NFL Boston Braves 4 4 ... -2.4 -1.6 -4.0 -4.0 -0.1
[2094 rows x 29 columns]