好了,这是我目前正在起草的代码,用于提取所有国家联赛球员的防守数据。它工作得很好,但是,我有兴趣知道如何在不干扰任何数据的情况下在数据框中只删除nan行:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# create a url object
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
# create list of the stats that we care about
standardFieldingStats = [
'player',
'team_ID',
'G',
'GS',
'CG',
'Inn_def',
'chances',
'PO',
'A',
'E_def',
'DP_def',
'fielding_perc',
'tz_runs_total',
'tz_runs_total_per_season',
'bis_runs_total',
'bis_runs_total_per_season',
'bis_runs_good_plays',
'range_factor_per_nine',
'range_factor_per_game',
'pos_summary'
]
# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
# grab each teams current year batting stats and turn it into a dataframe
tableNLFielding = soup.find('table', id='players_players_standard_fielding_fielding')
# grab player UID
puidList = []
rows = tableNLFielding.select('tr')
for row in rows:
playerUID = row.select_one('td[data-append-csv]')
playerUID = playerUID.get('data-append-csv')if playerUID else None
if playerUID == None:
continue
else:
puidList.append(playerUID)
# grab players position
compList = []
for row in rows:
thingList = []
for stat in range(len(standardFieldingStats)):
thing = row.find("td", attrs={"data-stat" : standardFieldingStats[stat]})
if thing == None:
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Team Totals':
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 NL teams':
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 AL teams':
continue
elif thing.text == '':
continue
elif thing.text == 'NaN':
continue
else:
thingList.append(thing.text)
compList.append(thingList)
# insert the batting headers to a dataframe
NLFieldingDf = pd.DataFrame(data=compList, columns=standardFieldingStats)
#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.dropna().values))
#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.fillna('').values))
# make all NaNs blanks for aesthic reasons
#NLFieldingDf = NLFieldingDf.fillna('')
#NLFieldingDf.insert(loc=0, column='pUID', value=puidList)
示例如下:我想删除nan的数据框:
player team pos_summary
NaN NaN NaN
Brandon Woodruff NaN P
William Woods ATL NaN
Kyle Wright ATL P
当我尝试时,我的数据框看起来像这样,将数据移出了位置:
player team pos_summary
Brandon Woodruff ATL P
William Woods ATL P
Kyle Wright
理想情况下,我想要这个,但没有NaN行并维护部分NaN行:
player team pos_summary
Brandon Woodruff P
William Woods ATL
Kyle Wright ATL P
请参阅完整代码的末尾,查看我的尝试。
尝试删除所有NaN行
df.dropna(=如何"all"
此外,如果需要用"替换NaN值,则使用
df。fillna("原地= True)
您可以这样做,但是,您的数据不准确。你不应该在球员位置或球队中得到空值。
其次,如果您需要解析<table>
标记(并且您不需要提取任何属性,如href),让pandas
为您解析该表。它在引擎盖下使用了漂亮的汤。
import pandas as pd
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
df = pd.read_html(url)[-1]
df = df[df['Rk'].ne('Rk')]
输出:
print(df[['Name', 'Tm', 'Pos Summary']])
Name Tm Pos Summary
0 C.J. Abrams SDP SS-2B-OF
1 Ronald Acuna Jr. ATL OF
2 Willy Adames MIL SS
3 Austin Adams SDP P
4 Riley Adams WSN C-1B
.. ... ... ...
509 Miguel Yajure PIT P
510 Mike Yastrzemski SFG OF
511 Christian Yelich MIL OF
512 Juan Yepez STL OF
513 Huascar Ynoa ATL P
[495 rows x 3 columns]