Pandas识别重复的记录,创建一个新列并添加第一个出现的ID



我是python的新手,所以请宽恕我:(

比方说,有一个像这样的数据帧

ID       B        C       D        E        isDuplicated
1       Blue     Green   Blue     Pink           false
2       Red      Green   Red      Green          false
3       Red      Orange  Yellow   Green          false
4       Blue     Pink    Blue     Pink           false
5       Blue     Orange  Pink     Green          false
6       Blue     Orange  Pink     Green          true
7       Red      Orange  Yellow   Green          true
8       Red      Orange  Yellow   Green          true

如果我在子集=B,C,D,E的行中有重复项。然后我想添加另一列"firstOccurred",它应该具有第一次出现的ID。我想要的数据帧应该是这样的:

ID       B        C       D        E        isDuplicated        firstOccurred
1       Blue     Green   Blue     Pink           false                         
2       Red      Green   Red      Green          false
3       Red      Orange  Yellow   Green          false
4       Blue     Pink    Blue     Pink           false
5       Blue     Orange  Pink     Green          false
6       Blue     Orange  Pink     Green          true               5
7       Red      Orange  Yellow   Green          true               3
8       Red      Orange  Yellow   Green          true               3

如果有任何帮助,我将不胜感激!提前谢谢!

仅将GroupBy.transformfirst用于在numpy.where:中传递True的行

df['firstOccurred'] = np.where(df['isDuplicated'], 
df.groupby(['B','C','D','E'])['ID'].transform('first'), 
np.nan)
print (df)
ID     B       C       D      E  isDuplicated  firstOccurred
0   1  Blue   Green    Blue   Pink         False            NaN
1   2   Red   Green     Red  Green         False            NaN
2   3   Red  Orange  Yellow  Green         False            NaN
3   4  Blue    Pink    Blue   Pink         False            NaN
4   5  Blue  Orange    Pink  Green         False            NaN
5   6  Blue  Orange    Pink  Green          True            5.0
6   7   Red  Orange  Yellow  Green          True            3.0
7   8   Red  Orange  Yellow  Green          True            3.0

最新更新