我需要删除基于两个列(person1和person2列)有字符串的组合的重复行。例如person1: ryan和person2: delta或者person1: delta和person2: ryan是相同的,并且在messages列中提供相同的值。需要删除这两行中的一行。同时返回不重复的行。
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0 0 ryan delta 1
1 1 delta ryan 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 bravo delta 3
5 5 alpha ryan 9
6 6 ryan alpha 9
答案df应该是:
finaldf
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
尝试如下:
res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
- 首先,我们使用
df.filter
来选择person*
的列。 - 对于这些列,我们仅使用
df.apply
将每行(axis=1
)转换为frozenset
。所以,在这个阶段,我们看到的pd.Series
是这样的:
0 (ryan, delta)
1 (ryan, delta)
2 (alpha, delta)
3 (bravo, delta)
4 (bravo, delta)
5 (alpha, ryan)
6 (alpha, ryan)
dtype: object
- 现在,我们想要选择重复的行,使用
Series.duplicated
并将~
作为前缀添加到结果布尔序列中,以从原始df
中选择逆。最后,我们用df.reset_index
重置索引。
与@ouroboros1给出的方法相比,这是一种不太通用的方法,它只适用于两列的情况
#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]
您可以在每行中按顺序排列两个person列,然后删除重复项。
import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)
交换后:
person1 person2 messages
0 0 ryan delta 1
1 1 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 delta bravo 3
5 5 ryan alpha 9
6 6 ryan alpha 9
删除重复项后:
person1 person2 messages
0 0 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
5 5 ryan alpha 9