根据Pandas中两个列的组合删除重复项



我需要删除基于两个列(person1和person2列)有字符串的组合的重复行。例如person1: ryan和person2: delta或者person1: delta和person2: ryan是相同的,并且在messages列中提供相同的值。需要删除这两行中的一行。同时返回不重复的行。

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"], 
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"], 
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0   0   ryan    delta   1
1   1   delta   ryan    1
2   2   delta   alpha   2
3   3   delta   bravo   3
4   4   bravo   delta   3
5   5   alpha   ryan    9
6   6   ryan    alpha   9

答案df应该是:

finaldf
person1 person2 messages
0   0   ryan    delta   1
1   2   delta   alpha   2
2   3   delta   bravo   3
3   5   alpha   ryan    9

尝试如下:

res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2  messages
0  0    ryan   delta         1
1  2   delta   alpha         2
2  3   delta   bravo         3
3  5   alpha    ryan         9

  • 首先,我们使用df.filter来选择person*的列。
  • 对于这些列,我们仅使用df.apply将每行(axis=1)转换为frozenset。所以,在这个阶段,我们看到的pd.Series是这样的:
0     (ryan, delta)
1     (ryan, delta)
2    (alpha, delta)
3    (bravo, delta)
4    (bravo, delta)
5     (alpha, ryan)
6     (alpha, ryan)
dtype: object
  • 现在,我们想要选择重复的行,使用Series.duplicated并将~作为前缀添加到结果布尔序列中,以从原始df中选择。最后,我们用df.reset_index重置索引。

与@ouroboros1给出的方法相比,这是一种不太通用的方法,它只适用于两列的情况

#make a Series of strings of min of p1/p2 concat to max of p1/p2  
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]

您可以在每行中按顺序排列两个person列,然后删除重复项。

import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)

交换后:

person1 person2  messages
0  0    ryan   delta         1
1  1    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
4  4   delta   bravo         3
5  5    ryan   alpha         9
6  6    ryan   alpha         9

删除重复项后:

person1 person2  messages
0  0    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
5  5    ryan   alpha         9

最新更新