我有一个多索引的df。IDX1和IDX2都有重复的值(分别为'C'和'Y'),我想保留两者的第一个实例并保留索引值;df_desired是我想要的结果。
我已经尝试过df_tst = df.groupby(level=[0,2]).first()
,但这会从df中删除IDX2,因此我无法从IDX1中删除副本。
# starting data
import pandas as pd
data =
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65],
['31/01/2021','C','Q',40,45],
['28/02/2021','C','Q',50,55],
['31/03/2021','C','Q',60,65],
['31/01/2021','D','Y',20,15],
['28/02/2021','D','Y',20,15],
['31/03/2021','D','Y',30,30]]
df=pd.DataFrame(data)
df.columns = ['DATE','IDX1','IDX2','VAR1','VAR2']
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index(['IDX1','IDX2','DATE'], inplace=True)
# target df:
df_desired =
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65]]
使用说明:
#get first indices per first and second level with remove last level of MultiIndex
df1 = df.droplevel(-1)
idx1 = df1.groupby(level=0).head(1).index
idx2 = df1.groupby(level=1).head(1).index
#compare with intersecton of both levels
df = df[df1.index.isin(idx1.intersection(idx2))]
print (df)
VAR1 VAR2
IDX1 IDX2 DATE
A X 2021-01-31 10 15
2021-02-28 20 30
2021-03-31 30 45
B Y 2021-01-31 20 15
2021-02-28 20 15
2021-03-31 30 30
C Z 2021-01-31 40 45
2021-02-28 50 55
2021-03-31 60 65
或:
#test duplicated values by first and second level with helper df1 DataFrame
df1 = df.index.to_frame()
df2 = df.droplevel(-1)
idx = df2.index[~df1.duplicated(['IDX1']) & ~df1.duplicated(['IDX2'])]
df = df[df2.index.isin(idx)]
print (df)
VAR1 VAR2
IDX1 IDX2 DATE
A X 2021-01-31 10 15
2021-02-28 20 30
2021-03-31 30 45
B Y 2021-01-31 20 15
2021-02-28 20 15
2021-03-31 30 30
C Z 2021-01-31 40 45
2021-02-28 50 55
2021-03-31 60 65