如何通过对分组行执行OHLC操作来聚合DataFrame中一天下午4点之后到第二天上午10点之前的所有行?
这将从转换原始DataFrame
symbol datetime open high low close date toCombine
0 AAPL 2020-01-01 15:00 3 5 2 4 2020-01-01 False
1 AAPL 2020-01-01 15:30 4 10 4 8 2020-01-01 False
2 AAPL 2020-01-01 16:00 8 15 6 12 2020-01-01 False
3 AAPL 2020-01-01 18:00 12 20 8 16 2020-01-01 True
4 AAPL 2020-01-01 20:00 12 20 8 16 2020-01-01 True
5 AAPL 2020-01-02 07:00 15 24 9 19 2020-01-02 True
6 AAPL 2020-01-02 10:00 16 25 10 20 2020-01-02 True
7 AAPL 2020-01-02 12:00 20 30 12 24 2020-01-02 False
8 AAPL 2020-01-02 14:00 24 70 14 26 2020-01-02 False
9 AAPL 2020-01-02 16:00 103 105 102 104 2020-01-02 False
10 AAPL 2020-01-02 18:00 104 100 104 196 2020-01-02 True
11 AAPL 2020-01-03 08:00 108 110 106 112 2020-01-03 True
12 AAPL 2020-01-03 10:30 112 120 108 116 2020-01-03 False
13 AAPL 2020-01-03 13:00 115 124 109 119 2020-01-03 False
到(忽略此处的索引值(:
symbol datetime open high low close date
0 AAPL 2020-01-01 15:00 3 5 2 4 2020-01-01
1 AAPL 2020-01-01 15:30 4 10 4 8 2020-01-01
2 AAPL 2020-01-01 16:00 8 15 6 12 2020-01-01
6 AAPL 2020-01-02 10:00 12 25 8 20 2020-01-02 <---- aggregated row
7 AAPL 2020-01-02 12:00 20 30 12 24 2020-01-02
8 AAPL 2020-01-02 14:00 24 70 14 26 2020-01-02
9 AAPL 2020-01-02 16:00 103 105 102 104 2020-01-02
11 AAPL 2020-01-03 10:00 103 110 100 112 2020-01-03 <---- aggregated 10:00 row created if not exist
12 AAPL 2020-01-03 10:30 112 120 108 116 2020-01-03
13 AAPL 2020-01-03 13:00 115 124 109 119 2020-01-03
注:
已经创建了
toCombine
列,用于标记将聚合为具有10:00
时间的datetime
值的单行。如果具有
datetime
值且时间为10:00
的行不存在,则应创建该行。但是,如果也没有toCombine == True
的行可从中聚合,则不需要创建10:00
行。
谢谢!
Python代码设置问题
import pandas as pd
data = [
('AAPL', '2020-01-01 15:00', 3, 5, 2, 4, '2020-01-01', False),
('AAPL', '2020-01-01 15:30', 4, 10, 4, 8, '2020-01-01', False),
('AAPL', '2020-01-01 16:00', 8, 15, 6, 12, '2020-01-01', False),
('AAPL', '2020-01-01 18:00', 12, 20, 8, 16, '2020-01-01', True),
('AAPL', '2020-01-01 20:00', 12, 20, 8, 16, '2020-01-01', True),
('AAPL', '2020-01-02 07:00', 15, 24, 9, 19, '2020-01-02', True),
('AAPL', '2020-01-02 10:00', 16, 25, 10, 20, '2020-01-02', True),
('AAPL', '2020-01-02 12:00', 20, 30, 12, 24, '2020-01-02', False),
('AAPL', '2020-01-02 14:00', 24, 70, 14, 26, '2020-01-02', False),
('AAPL', '2020-01-02 16:00', 103, 105, 102, 104, '2020-01-02', False),
('AAPL', '2020-01-02 18:00', 104, 100, 104, 196, '2020-01-02', True),
('AAPL', '2020-01-03 08:00', 108, 110, 106, 112, '2020-01-03', True),
('AAPL', '2020-01-03 10:30', 112, 120, 108, 116, '2020-01-03', False),
('AAPL', '2020-01-03 13:00', 115, 124, 109, 119, '2020-01-03', False),
]
df = pd.DataFrame(data, columns=['symbol', 'datetime', 'open', 'high', 'low', 'close', 'date', 'toCombine'])
print(df)
我的方法
#Group and agg
m = df['toCombine']
agg_dict = {'datetime' : 'last',
'open' : 'first',
'high' : 'max',
'low' : 'min',
'close' : 'last'}
reduce_df = (df.loc[m].groupby(['symbol',(~m).cumsum()],
as_index=False).agg(agg_dict))
print(reduce_df)
symbol datetime open high low close
0 AAPL 2020-01-02 10:00 12 25 8 20
1 AAPL 2020-01-03 08:00 104 110 104 112
如果排序正确,按符号分组可能不必要
#get correct datetime, append and sort
datetime = pd.to_datetime(reduce_df['datetime'])
hours = datetime.dt.hour
df = (reduce_df.assign(datetime = (datetime.mask(hours.gt(16),
datetime.add(pd.to_timedelta(10 + 24 - hours,
unit='h')))
.mask(hours.lt(10),
datetime.add(pd.to_timedelta(10 - hours,
unit='h')))),
date = lambda x: x['datetime'].dt.date)
.assign(datetime = lambda x: x['datetime'].dt.strftime('%Y-%m-%d %H:%M'))
.append(df.loc[~m])
.drop(columns='toCombine')
.sort_values(['symbol','datetime','date'])
)
print(df)
symbol datetime open high low close date
0 AAPL 2020-01-01 15:00 3 5 2 4 2020-01-01
1 AAPL 2020-01-01 15:30 4 10 4 8 2020-01-01
2 AAPL 2020-01-01 16:00 8 15 6 12 2020-01-01
0 AAPL 2020-01-02 10:00 12 25 8 20 2020-01-02
7 AAPL 2020-01-02 12:00 20 30 12 24 2020-01-02
8 AAPL 2020-01-02 14:00 24 70 14 26 2020-01-02
9 AAPL 2020-01-02 16:00 103 105 102 104 2020-01-02
1 AAPL 2020-01-03 10:00 104 110 104 112 2020-01-03
12 AAPL 2020-01-03 10:30 112 120 108 116 2020-01-03
我们以前可以使用dataframe.set_index
、dataframe.asfreq
和ffill
进行分组,这样以后就不必更改日期时间,但我认为性能将与相似
如果要包含小时=16,则需要更改datetime.mask(hours.
ge(16).......
当我们使用.loc
时,包括它,因为它不包括在toCombine
中(第9行错误(