我正试图找到一种方法,在每天的数据帧中找到每个资产的第一个和最后一个时间戳。例如,我有一个数据帧:
import pandas as pd
data = {
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01',
'2022-01-01' ,'2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02',
'2022-01-02','2022-01-02','2022-01-03','2022-01-03','2022-01-03','2022-01-03','2022-01-03',
'2022-01-03','2022-01-03','2022-01-03'],
'Time':['12:01','12:05','14:07','11:01','13:06','17:12','15:15',
'9:02','8:06','14:06','19:19','10:00','13:01','17:00','10:15',
'8:00','9:00','7:15','16:04','15:02','17:10','12:06','15:00'],
'Asset':[111,111,111,222,222,222,222,
111,111,111,111,111,222,222,222,
333,333,111,111,111,111,333,111]
}
df = pd.DataFrame(data)
df
看起来像:
Date Time Asset
0 2022-01-01 12:01 111
1 2022-01-01 12:05 111
2 2022-01-01 14:07 111
3 2022-01-01 11:01 222
4 2022-01-01 13:06 222
5 2022-01-01 17:12 222
6 2022-01-01 15:15 222
7 2022-01-02 9:02 111
8 2022-01-02 8:06 111
9 2022-01-02 14:06 111
10 2022-01-02 19:19 111
11 2022-01-02 10:00 111
12 2022-01-02 13:01 222
13 2022-01-02 17:00 222
14 2022-01-02 10:15 222
15 2022-01-03 8:00 333
16 2022-01-03 9:00 333
17 2022-01-03 7:15 111
18 2022-01-03 16:04 111
19 2022-01-03 15:02 111
20 2022-01-03 17:10 111
21 2022-01-03 12:06 333
22 2022-01-03 15:00 111
我想按天对这些数据进行分组,并删除每天每个资产的所有重复项,只保留每天每个值的第一个和最后一个时间戳。我的理想结果是这样的:
data1 = {
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01',
'2022-01-02','2022-01-02','2022-01-02','2022-01-02',
'2022-01-03','2022-01-03','2022-01-03','2022-01-03',],
'Time':['12:01','14:07','11:01','17:12',
'8:06','19:19','10:15','17:00',
'8:00','12:06','7:15','17:10'],
'Asset':[111,111,222,222,
111,111,222,222,
333,333,111,111]
}
df1 = pd.DataFrame(data1)
df1
看起来像:
Date Time Asset
0 2022-01-01 12:01 111
1 2022-01-01 14:07 111
2 2022-01-01 11:01 222
3 2022-01-01 17:12 222
4 2022-01-02 8:06 111
5 2022-01-02 19:19 111
6 2022-01-02 10:15 222
7 2022-01-02 17:00 222
8 2022-01-03 8:00 333
9 2022-01-03 12:06 333
10 2022-01-03 7:15 111
11 2022-01-03 17:10 111
理想情况下,我想用Python解决这个问题,但如果在R或SQL中有更简单的解决方案,我可以使用它们。任何帮助都将不胜感激!提前感谢!
import pandas as pd
data = {
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01',
'2022-01-01' ,'2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02',
'2022-01-02','2022-01-02','2022-01-03','2022-01-03','2022-01-03','2022-01-03','2022-01-03',
'2022-01-03','2022-01-03','2022-01-03'],
'Time':['12:01','12:05','14:07','11:01','13:06','17:12','15:15',
'9:02','8:06','14:06','19:19','10:00','13:01','17:00','10:15',
'8:00','9:00','7:15','16:04','15:02','17:10','12:06','15:00'],
'Asset':[111,111,111,222,222,222,222,
111,111,111,111,111,222,222,222,
333,333,111,111,111,111,333,111]
}
df = pd.DataFrame(data)
df_f = df.groupby(by=['Date', 'Asset']).first().reset_index()
df_l = df.groupby(by=['Date', 'Asset']).last().reset_index()
df_fl = pd.concat([df_f, df_l])[['Date', 'Time', 'Asset']]
df_fl = df_fl.sort_values(by=['Date', 'Asset', 'Time']).reset_index().drop(columns=['index'])
print(df_fl)
打印
索引 | 日期 | 时间资产 | |
---|---|---|---|
0 | 2022-021-01 | 12:01 | 111 |
1 | 2022-01-01 | 14:07 | 111 |
2 | 2022-01-01 | 11:01 | 222 |
3 | 2022-01-01 | 15:15 | 222 |
4 | 2022-01-02 | 10:00 | 111 |
5 | 2022-01-02 | 9:02 | >111 |
6 | 2022-01-02 | 10:15 | 222 |
7 | 2022-01-02 | 13:01 | 222 |
8 | 2022-01-03 | 15:00 | 111 |
9 | 2022-01-03 | 7:15 | 111 |
10 | 2022-01-03 | 12:06 | >333 |
11 | 2022-01-03 | 8:00 | 333
这里有一种方法可以实现
使用groupby获取每个日期和资产的最大和最小时间,然后堆叠并重置索引。
#convert time (in str) to time with a two digit hour and two digit minute, to get the right min/max
df['Time']=df['Time'].astype('datetime64').dt.strftime('%H:%M')
df.groupby(['Date', 'Asset'] )['Time'].agg({'max', 'min'}).stack().reset_index().rename(columns={0:'Time'}).drop(columns='level_2').sort_values(['Date','Asset','Time'])
Date Asset Time
1 2022-01-01 111 12:01
0 2022-01-01 111 14:07
3 2022-01-01 222 11:01
2 2022-01-01 222 17:12
5 2022-01-02 111 08:06
4 2022-01-02 111 19:19
7 2022-01-02 222 10:15
6 2022-01-02 222 17:00
9 2022-01-03 111 07:15
8 2022-01-03 111 17:10
11 2022-01-03 333 08:00
10 2022-01-03 333 12:06