我有以下数据帧:
df1 = pd.DataFrame({"id": ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1','C2','C3','C4' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A333', 'A444', 'NA', 'NA', 'B555', 'B666', 'NA', 'C777', 'C888', 'C999'],
"other_col_1": [123, 123, 'NA', 'NA', 0.765, 0.555, 'NA', 'NA', 0.324, 'NA', 'NA','NA']})
我想删除id列与"中的值对应两次的行;other_ col";并且对于每个组仅保留最近的行。生成的数据帧应该是:
df_new = pd.DataFrame({"id": ['A1', 'A2', 'A4', 'B1', 'B2', 'B4', 'C1','C4' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
"other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})
首先将值NA
转换为other_col
中的缺失值,如有必要,按id
和date
s对值进行排序,这样就可以根据创建的无编号id
的组按other_col
和GroupBy.last
获得最后一个无缺失值,最后一个过滤器匹配other_col
中缺失值的行:
df1['other_col'] = df1['other_col'].replace('NA', np.nan)
df1 = df1.sort_values(['id','date'])
s = df1.groupby(df1['id'].str.replace('d',''))['other_col'].transform('last')
df_new = df1[df1['other_col'].eq(s) | df1['other_col'].isna()]
print (df_new)
id date other_col other_col_1
0 A1 2015-12-30 NaN 123
1 A2 2016-12-30 NaN 123
3 A4 2018-12-30 A444 NA
4 B1 2015-12-30 NaN 0.765
5 B2 2016-12-30 NaN 0.555
7 B4 2018-12-30 B666 NA
8 C1 2016-12-30 NaN 0.324
11 C4 2019-12-30 C999 NA
IIUC,您可以groupby
字母和NA状态,并获得last
:
df2 = df1.groupby([df1['id'].str[0], df1['other_col'].eq('NA')],
sort=False, as_index=False).last()
输出:
id date other_col
0 A1 2016-12-30 NA
1 A3 2018-12-30 444
2 B1 2016-12-30 NA
3 B3 2018-12-30 222
4 C1 2016-12-30 NA
5 C4 2019-12-30 888
获取id的更通用方法:df1['id'].str.extract('^(D)', expand=False)
如果您在other_col中有真实的NaN,请使用df1['other_col'].isna()