Python: Pandas -只删除NaN行并上移数据,不上移部分NaN行的数据



好了,这是我目前正在起草的代码,用于提取所有国家联赛球员的防守数据。它工作得很好,但是,我有兴趣知道如何在不干扰任何数据的情况下在数据框中只删除nan行:

# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# create a url object
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
# create list of the stats that we care about
standardFieldingStats = [
'player',
'team_ID',
'G',
'GS',
'CG',
'Inn_def',
'chances',
'PO',
'A',
'E_def',
'DP_def',
'fielding_perc',
'tz_runs_total',
'tz_runs_total_per_season',
'bis_runs_total',
'bis_runs_total_per_season',
'bis_runs_good_plays',
'range_factor_per_nine',
'range_factor_per_game',
'pos_summary'
]
# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
# grab each teams current year batting stats and turn it into a dataframe
tableNLFielding = soup.find('table', id='players_players_standard_fielding_fielding')
# grab player UID
puidList = []
rows = tableNLFielding.select('tr')
for row in rows:
playerUID = row.select_one('td[data-append-csv]')
playerUID = playerUID.get('data-append-csv')if playerUID else None
if playerUID == None:
continue
else:
puidList.append(playerUID)
# grab players position
compList = []
for row in rows:
thingList = []
for stat in range(len(standardFieldingStats)):
thing = row.find("td", attrs={"data-stat" : standardFieldingStats[stat]})
if thing == None:
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Team Totals':
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 NL teams':
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 AL teams':
continue
elif thing.text == '':
continue
elif thing.text == 'NaN':
continue
else:
thingList.append(thing.text)
compList.append(thingList)
# insert the batting headers to a dataframe
NLFieldingDf = pd.DataFrame(data=compList, columns=standardFieldingStats)
#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.dropna().values))
#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.fillna('').values))
# make all NaNs blanks for aesthic reasons
#NLFieldingDf = NLFieldingDf.fillna('')
#NLFieldingDf.insert(loc=0, column='pUID', value=puidList)

示例如下:我想删除nan的数据框:

player             team   pos_summary
NaN                NaN    NaN
Brandon Woodruff   NaN    P   
William Woods      ATL    NaN
Kyle Wright        ATL    P

当我尝试时,我的数据框看起来像这样,将数据移出了位置:

player             team   pos_summary
Brandon Woodruff   ATL    P   
William Woods      ATL    P
Kyle Wright

理想情况下,我想要这个,但没有NaN行并维护部分NaN行:

player             team   pos_summary
Brandon Woodruff          P   
William Woods      ATL    
Kyle Wright        ATL    P

请参阅完整代码的末尾,查看我的尝试。

尝试删除所有NaN行

df.dropna(=如何"all"

此外,如果需要用"替换NaN值,则使用

df。fillna("原地= True)

您可以这样做,但是,您的数据不准确。你不应该在球员位置或球队中得到空值。

其次,如果您需要解析<table>标记(并且您不需要提取任何属性,如href),让pandas为您解析该表。它在引擎盖下使用了漂亮的汤。

import pandas as pd
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
df = pd.read_html(url)[-1]
df = df[df['Rk'].ne('Rk')]   

输出:

print(df[['Name', 'Tm', 'Pos Summary']])
Name   Tm Pos Summary
0         C.J. Abrams  SDP    SS-2B-OF
1    Ronald Acuna Jr.  ATL          OF
2        Willy Adames  MIL          SS
3        Austin Adams  SDP           P
4         Riley Adams  WSN        C-1B
..                ...  ...         ...
509     Miguel Yajure  PIT           P
510  Mike Yastrzemski  SFG          OF
511  Christian Yelich  MIL          OF
512        Juan Yepez  STL          OF
513      Huascar Ynoa  ATL           P
[495 rows x 3 columns]

最新更新