保留同一组内具有相同(最近)日期的最后一组obs



是否有一种方法可以在"组"中只保留最新的观察结果?

例如,我只想保留每个PrimaryID-SecondaryID对的最新观察值。

PrimaryID   SecondaryID     SubAccount  Value   ReportDate
0   1   A   123     5618.48     2022-01-01
1   1   A   456     8206.23     2022-01-01
2   1   A   123     6722.05     2022-07-01
3   1   A   456     5500.53     2022-07-01
4   1   B   789     8990.75     2022-02-01
5   1   B   987     6294.63     2022-02-01
6   1   B   789     8389.60     2022-03-01
7   1   B   246     343.02  2022-03-01
8   2   X   234     4157.57     2022-02-01
9   2   X   752     8218.00     2022-02-01
10  2   X   234     6430.68     2022-03-01
11  2   X   755     7148.57     2022-03-01
12  2   Y   731     5406.63     2022-05-02
13  2   Y   480     2429.83     2022-05-02
14  2   Y   731     6251.38     2022-06-01
15  2   Y   841     8256.93     2022-06-01

这是完成此任务的一种方法,但看起来很草率。

df['lastRptDt'] = df.groupby(['PrimaryID', 'SecondaryID'])['ReportDate'].transform(max)
df1 = df[(df['ReportDate']==df['lastRptDt'])]

这是期望的输出:

PrimaryID   SecondaryID     SubAccount  Value   ReportDate  lastRptDt
2   1   A   123     6722.05     2022-07-01  2022-07-01
3   1   A   456     5500.53     2022-07-01  2022-07-01
6   1   B   789     8389.60     2022-03-01  2022-03-01
7   1   B   246     343.02  2022-03-01  2022-03-01
10  2   X   234     6430.68     2022-03-01  2022-03-01
11  2   X   755     7148.57     2022-03-01  2022-03-01
14  2   Y   731     6251.38     2022-06-01  2022-06-01
15  2   Y   841     8256.93     2022-06-01  2022-06-01

编辑/更新:在这篇相关文章(改进一行python脚本的时间)的评论中,@rpanai要求从df.to_dict()中获得原始和期望的输出。

原始数据

>>> df.to_dict()
{'PrimaryID': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 2,
9: 2,
10: 2,
11: 2,
12: 2,
13: 2,
14: 2,
15: 2},
'SecondaryID': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'B',
5: 'B',
6: 'B',
7: 'B',
8: 'X',
9: 'X',
10: 'X',
11: 'X',
12: 'Y',
13: 'Y',
14: 'Y',
15: 'Y'},
'SubAccount': {0: 123,
1: 456,
2: 123,
3: 456,
4: 789,
5: 987,
6: 789,
7: 246,
8: 234,
9: 752,
10: 234,
11: 755,
12: 731,
13: 480,
14: 731,
15: 841},
'Value': {0: 5618.48,
1: 8206.23,
2: 6722.05,
3: 5500.53,
4: 8990.75,
5: 6294.63,
6: 8389.6,
7: 343.02,
8: 4157.57,
9: 8218.0,
10: 6430.68,
11: 7148.57,
12: 5406.63,
13: 2429.83,
14: 6251.38,
15: 8256.93},
'ReportDate': {0: Timestamp('2022-01-01 00:00:00'),
1: Timestamp('2022-01-01 00:00:00'),
2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
4: Timestamp('2022-02-01 00:00:00'),
5: Timestamp('2022-02-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
8: Timestamp('2022-02-01 00:00:00'),
9: Timestamp('2022-02-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
12: Timestamp('2022-05-02 00:00:00'),
13: Timestamp('2022-05-02 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}

所需数据集

>>> df1.to_dict()
{'PrimaryID': {2: 1, 3: 1, 6: 1, 7: 1, 10: 2, 11: 2, 14: 2, 15: 2},
'SecondaryID': {2: 'A',
3: 'A',
6: 'B',
7: 'B',
10: 'X',
11: 'X',
14: 'Y',
15: 'Y'},
'SubAccount': {2: 123,
3: 456,
6: 789,
7: 246,
10: 234,
11: 755,
14: 731,
15: 841},
'Value': {2: 6722.05,
3: 5500.53,
6: 8389.6,
7: 343.02,
10: 6430.68,
11: 7148.57,
14: 6251.38,
15: 8256.93},
'ReportDate': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')},
'lastRptDt': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}

这个怎么样?

df.set_index(['PrimaryID', 'SecondaryID', 'ReportDate']).loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
Out[54]: 
SubAccount    Value  lastRptDt
PrimaryID SecondaryID ReportDate                                
1         A           2022-07-01         123  6722.05 2022-07-01
2022-07-01         456  5500.53 2022-07-01
B           2022-03-01         789  8389.60 2022-03-01
2022-03-01         246   343.02 2022-03-01
2         X           2022-03-01         234  6430.68 2022-03-01
2022-03-01         755  7148.57 2022-03-01
Y           2022-06-01         731  6251.38 2022-06-01
2022-06-01         841  8256.93 2022-06-01

要保留索引,可以使用:


df.reset_index()
.set_index(['PrimaryID', 'SecondaryID', 'ReportDate'], drop=False)
.loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
.set_index('index')

Out[49]: 
PrimaryID SecondaryID  ReportDate  SubAccount    Value
index                                                        
2              1           A  2022-07-01         123  6722.05
3              1           A  2022-07-01         456  5500.53
6              1           B  2022-03-01         789  8389.60
7              1           B  2022-03-01         246   343.02
10             2           X  2022-03-01         234  6430.68
11             2           X  2022-03-01         755  7148.57
14             2           Y  2022-06-01         731  6251.38
15             2           Y  2022-06-01         841  8256.93

reset_index()将附加原始索引作为列,您可以在完成后将其重置为索引。注意set_index(.., drop=False)中需要首先使用第二列展示。

相关内容

  • 没有找到相关文章

最新更新