如何使用分组方式过滤数据帧中的重复项?

我有一个数据帧df。(cfg, x, rounds)是独一无二的，其余的则不是。

cfg   x     rounds  score  rewards  
0  f63c2c   a          1   0.01       10  
1  f63c2c   a          2   0.02       15  
2  f63c2c   b          3   0.03       30  
3  f63c2c   b          4   0.04       13  
4  f63c2c   b          5   0.05        8  
5  37fb26   a          1   0.08        8  
6  35442a   a          5   0.19        8  
7  bb8460   b          2   0.05        9

我想以这样一种方式过滤数据帧，结果中只有cfg, x, max(rounds)行，即

cfg  x  rounds  score  rewards  
1  f63c2c  a       2   0.02       15  
4  f63c2c  b       5   0.05        8  
5  37fb26  a       1   0.08        8  
6  35442a  a       5   0.19        8  
7  bb8460  b       2   0.05        9

为此，我使用以下方法确定最大值：

gf = df.groupby(["cfg", "x"]).max().loc[:,["rounds"]]

但是，我还没有找到一种使用 gf 作为谓词提供程序来过滤 df 的方法。有什么想法吗？

确实可以使用df.groupby和df.merge：

n [231]: df.groupby(['cfg', 'x']).rounds
...:             .apply(np.max).reset_index()
...:             .merge(df, on=['cfg', 'x', 'rounds'])
Out[231]: 
cfg  x  rounds  score  rewards
0  35442a  a       5   0.19        8
1  37fb26  a       1   0.08        8
2  bb8460  b       2   0.05        9
3  f63c2c  a       2   0.02       15
4  f63c2c  b       5   0.05        8

并且，使用df.sort_values：

In [237]: df.sort_values(by = ['cfg','x', 'rounds'],ascending = [True, True, False])
.drop_duplicates(subset = ['cfg', 'x'])
Out[237]: 
cfg  x  rounds  score  rewards
6  35442a  a       5   0.19        8
5  37fb26  a       1   0.08        8
7  bb8460  b       2   0.05        9
1  f63c2c  a       2   0.02       15
4  f63c2c  b       5   0.05        8

性能

df_test = pd.concat([df] * 100000) # Setup

使用df.merge：

%timeit df_test.sort_values(by = ['cfg','x', 'rounds'],ascending = [True, True, False])
.drop_duplicates(subset = ['cfg', 'x'])
1 loop, best of 3: 229 ms per loop

使用df.sort_values和df.drop_duplicates：

%timeit df_test.groupby(['cfg', 'x']).rounds
.apply(np.max).reset_index()
.merge(df, on=['cfg', 'x', 'rounds'])
10 loops, best of 3: 129 ms ms per loop

解决方案不是使用 groupby(或者更准确地说，最简单的解决方案不是使用 gorupby(，而是使用drop_duplicates。默认情况下，drop_duplicates保留任何重复值的第一行，因此您可以对数据帧进行排序，然后使用以下命令删除重复项：

gf = df.sort_values(by = 'rounds',ascending = [True,False]).
drop_duplicates(subset = ['cfg','x'])
cfg     x   rounds  score   rewards
6   35442a  a   5       0.19    8
5   37fb26  a   1       0.08    8
7   bb8460  b   2       0.05    9
4   f63c2c  b   5       0.05    8

您也可以等效地执行以下操作：

gf = df.sort_values(by = 'rounds',ascending = True).
drop_duplicates(subset = ['cfg','x'],keep = 'last')

编辑：时间

令人惊讶的是，在他的回答中，我没有得到与冷速相同的时间：

df_test = pd.concat([df] * 100000) 
%timeit df_test.sort_values(by = ['cfg','rounds'],ascending = True).
drop_duplicates(subset = ['cfg'],keep = 'last')
%timeit df_test.groupby('cfg').rounds.apply(np.max).reset_index().
merge(my_df2, on=['cfg', 'rounds'])
62 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
70.6 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这似乎不取决于内核的数量(我已经在 8 个和 12 个内核上启动了它，产生了相同的排名(，也不取决于数据帧的大小(我已经尝试了 10 000、100 000 和 1 000 000 的 df 大小df_test，排名保持不变(。

所以我想它必须取决于您的硬件，您只需尝试这两种方法，看看哪种方法适合您的计算机。

感谢冷速指出这一点

相关内容

最新更新

热门标签：