是否有一种方法可以在"组"中只保留最新的观察结果?
例如,我只想保留每个PrimaryID-SecondaryID对的最新观察值。
PrimaryID SecondaryID SubAccount Value ReportDate
0 1 A 123 5618.48 2022-01-01
1 1 A 456 8206.23 2022-01-01
2 1 A 123 6722.05 2022-07-01
3 1 A 456 5500.53 2022-07-01
4 1 B 789 8990.75 2022-02-01
5 1 B 987 6294.63 2022-02-01
6 1 B 789 8389.60 2022-03-01
7 1 B 246 343.02 2022-03-01
8 2 X 234 4157.57 2022-02-01
9 2 X 752 8218.00 2022-02-01
10 2 X 234 6430.68 2022-03-01
11 2 X 755 7148.57 2022-03-01
12 2 Y 731 5406.63 2022-05-02
13 2 Y 480 2429.83 2022-05-02
14 2 Y 731 6251.38 2022-06-01
15 2 Y 841 8256.93 2022-06-01
这是完成此任务的一种方法,但看起来很草率。
df['lastRptDt'] = df.groupby(['PrimaryID', 'SecondaryID'])['ReportDate'].transform(max)
df1 = df[(df['ReportDate']==df['lastRptDt'])]
这是期望的输出:
PrimaryID SecondaryID SubAccount Value ReportDate lastRptDt
2 1 A 123 6722.05 2022-07-01 2022-07-01
3 1 A 456 5500.53 2022-07-01 2022-07-01
6 1 B 789 8389.60 2022-03-01 2022-03-01
7 1 B 246 343.02 2022-03-01 2022-03-01
10 2 X 234 6430.68 2022-03-01 2022-03-01
11 2 X 755 7148.57 2022-03-01 2022-03-01
14 2 Y 731 6251.38 2022-06-01 2022-06-01
15 2 Y 841 8256.93 2022-06-01 2022-06-01
编辑/更新:在这篇相关文章(改进一行python脚本的时间)的评论中,@rpanai要求从df.to_dict()中获得原始和期望的输出。
原始数据
>>> df.to_dict()
{'PrimaryID': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 2,
9: 2,
10: 2,
11: 2,
12: 2,
13: 2,
14: 2,
15: 2},
'SecondaryID': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'B',
5: 'B',
6: 'B',
7: 'B',
8: 'X',
9: 'X',
10: 'X',
11: 'X',
12: 'Y',
13: 'Y',
14: 'Y',
15: 'Y'},
'SubAccount': {0: 123,
1: 456,
2: 123,
3: 456,
4: 789,
5: 987,
6: 789,
7: 246,
8: 234,
9: 752,
10: 234,
11: 755,
12: 731,
13: 480,
14: 731,
15: 841},
'Value': {0: 5618.48,
1: 8206.23,
2: 6722.05,
3: 5500.53,
4: 8990.75,
5: 6294.63,
6: 8389.6,
7: 343.02,
8: 4157.57,
9: 8218.0,
10: 6430.68,
11: 7148.57,
12: 5406.63,
13: 2429.83,
14: 6251.38,
15: 8256.93},
'ReportDate': {0: Timestamp('2022-01-01 00:00:00'),
1: Timestamp('2022-01-01 00:00:00'),
2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
4: Timestamp('2022-02-01 00:00:00'),
5: Timestamp('2022-02-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
8: Timestamp('2022-02-01 00:00:00'),
9: Timestamp('2022-02-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
12: Timestamp('2022-05-02 00:00:00'),
13: Timestamp('2022-05-02 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}
所需数据集
>>> df1.to_dict()
{'PrimaryID': {2: 1, 3: 1, 6: 1, 7: 1, 10: 2, 11: 2, 14: 2, 15: 2},
'SecondaryID': {2: 'A',
3: 'A',
6: 'B',
7: 'B',
10: 'X',
11: 'X',
14: 'Y',
15: 'Y'},
'SubAccount': {2: 123,
3: 456,
6: 789,
7: 246,
10: 234,
11: 755,
14: 731,
15: 841},
'Value': {2: 6722.05,
3: 5500.53,
6: 8389.6,
7: 343.02,
10: 6430.68,
11: 7148.57,
14: 6251.38,
15: 8256.93},
'ReportDate': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')},
'lastRptDt': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}
这个怎么样?
df.set_index(['PrimaryID', 'SecondaryID', 'ReportDate']).loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
Out[54]:
SubAccount Value lastRptDt
PrimaryID SecondaryID ReportDate
1 A 2022-07-01 123 6722.05 2022-07-01
2022-07-01 456 5500.53 2022-07-01
B 2022-03-01 789 8389.60 2022-03-01
2022-03-01 246 343.02 2022-03-01
2 X 2022-03-01 234 6430.68 2022-03-01
2022-03-01 755 7148.57 2022-03-01
Y 2022-06-01 731 6251.38 2022-06-01
2022-06-01 841 8256.93 2022-06-01
要保留索引,可以使用:
df.reset_index()
.set_index(['PrimaryID', 'SecondaryID', 'ReportDate'], drop=False)
.loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
.set_index('index')
Out[49]:
PrimaryID SecondaryID ReportDate SubAccount Value
index
2 1 A 2022-07-01 123 6722.05
3 1 A 2022-07-01 456 5500.53
6 1 B 2022-03-01 789 8389.60
7 1 B 2022-03-01 246 343.02
10 2 X 2022-03-01 234 6430.68
11 2 X 2022-03-01 755 7148.57
14 2 Y 2022-06-01 731 6251.38
15 2 Y 2022-06-01 841 8256.93
reset_index()
将附加原始索引作为列,您可以在完成后将其重置为索引。注意set_index(.., drop=False)
中需要首先使用第二列展示。