我有一个多重索引,我想在每层的基础上执行drop_duplicate,我不想看整个数据框,但只有当有相同的主索引的重复
的例子:
entry,subentry,A,B
1 0 1.0 1.0
1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
2 2.0 2.0
应该返回:
entry,subentry,A,B
1 0 1.0 1.0
1 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
使用MultiIndex.get_level_values
和Index.duplicated
来过滤boolean indexing
中每个entry
的最后一行:
df1 = df[df.index.get_level_values('entry').duplicated(keep='last')]
print (df1)
A B
entry subentry
1 0 1.0 1.0
1 1.0 1.0
2 0 1.0 1.0
1 2.0 2.0
或者如果需要删除每一级的重复项,并通过DataFrame.reset_index
将第一级的列转换为列,对于过滤器,通过~
反转布尔掩码并将Series
转换为numpy数组,因为掩码和原始DataFrame的索引不匹配:
df2 = df[~df.reset_index(level=0).duplicated(keep='last').to_numpy()]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
或者根据MultiIndex的第一层创建辅助列:
df2 = df[~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last')]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
:
print (df.reset_index(level=0))
entry A B
subentry
0 1 1.0 1.0
1 1 1.0 1.0
2 1 2.0 2.0
0 2 1.0 1.0
1 2 2.0 2.0
2 2 2.0 2.0
print (~df.reset_index(level=0).duplicated(keep='last'))
0 False
1 True
2 True
0 True
1 False
2 True
dtype: bool
print (df.assign(new=df.index.get_level_values('entry')))
A B new
entry subentry
1 0 1.0 1.0 1
1 1.0 1.0 1
2 2.0 2.0 1
2 0 1.0 1.0 2
1 2.0 2.0 2
2 2.0 2.0 2
print (~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last'))
entry subentry
1 0 False
1 True
2 True
2 0 True
1 False
2 True
dtype: bool
看起来你想为每组设置drop_duplicates
:
out = df.groupby(level=0, group_keys=False).apply(lambda d: d.drop_duplicates())
或者,一个可能更有效的变体,使用临时reset_index
和duplicated
和布尔索引:
out = df[~df.reset_index('entry').duplicated().values]
输出:
A B
entry subentry
1 0 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0