我必须处理一个与此类似的数据集:
Player CurrentTeam PreviousTeam
0 James Team A Null
1 David Team B Null
2 George Team C Null
3 James Team B Team A
4 James Team C Team B
... ...
在上面的例子中,詹姆斯实际上是同一个球员,在数据集中出现了3次。没有身份证之类的东西。
这里发生的事情是,詹姆斯实际上是为C队效力的。在C队之前,他在B队,甚至在A队(他的职业生涯开始的地方(之前。
然而,我只需要了解他的真实当前团队(即C团队(。
我能摆脱詹姆斯过去球队的所有其他事件吗?
根据这个答案中的技术,我们可以去掉任何作为(player,prevosteam(组合存在的(player、current_team(组合。
>>> df
Player CurrentTeam PreviousTeam
0 James Team A Null
1 David Team B Null
2 George Team C Null
3 James Team B Team A
4 James Team C Team B
>>> old_teams = df[["Player", "PreviousTeam"]].rename(columns={"PreviousTeam": "Team"})
>>> all_teams = df[["Player", "CurrentTeam"]].rename(columns={"CurrentTeam": "Team"})
>>> current_teams = (pd.merge(all_teams, old_teams, indicator=True, how="outer")
.query("_merge == 'left_only'")
.drop("_merge", axis=1)
.reset_index(drop=True))
>>> current_teams
Player Team
0 David Team B
1 George Team C
2 James Team C
使用DataFrame.drop_duplicates
new_df = df.drop_duplicates('Player', keep='last')
print(new_df)
如果要检查PreviousTeam
,请使用:
df.loc[~df.groupby('Player')['PreviousTeam']
.shift(-1)
.eq(df['CurrentTeam'])]
输出
Player CurrentTeam PreviousTeam
1 David Team B Null
2 George Team C Null
4 James Team C Team B
对于这种情况,一种可能的解决方案是将groupby
与tail
:一起使用
df = df.groupby('Player').tail(1)
print(df)
输出:
Player CurrentTeam PreviousTeam
1 David TeamB Null
2 George TeamC Null
4 James TeamC TeamB