在下面的示例 df 中,最好的保留方法是什么:
- 为每个
id
显示Score
时的第一行 - 然后,当每个
id
的值在Score
中更改时的第一行,并删除重复的行,直到它发生变化
示例 df
date id Score
0 2001-09-06 1 3
1 2001-09-07 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
4 2001-09-10 2 6
5 2001-09-11 1 4
6 2001-09-12 2 5
7 2001-09-13 2 5
8 2001-09-14 1 3
期望的 df
date id Score
0 2001-09-06 1 3
1 2001-09-08 1 4
2 2001-09-09 2 6
3 2001-09-12 2 5
4 2001-09-14 1 3
将groupby
与diff
一起使用:
print (df[df.groupby("id")["Score"].diff()!=0])
date id Score
0 2001-09-06 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
6 2001-09-12 2 5
8 2001-09-14 1 3
第一次出现将始终导致 !=0 的NaN
。
按照你的逻辑:
# shift Score within id
# shifted score at each group start is `NaN`
shifted_scores = df['Score'].groupby(df['id']).shift()
# change of Score within each id
# since first shifted score in each group is `NaN`
# mask is also True at first line of each group
mask = df['Score'].ne(shifted_scores)
# output
df[mask]
输出:
date id Score
0 2001-09-06 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
6 2001-09-12 2 5
8 2001-09-14 1 3
df.groupby(['id', 'score']).first()