使用Python Pandas DataFrame的选择性Groupby Aggregate



如何通过对分组行执行OHLC操作来聚合DataFrame中一天下午4点之后到第二天上午10点之前的所有行?

这将从转换原始DataFrame

symbol          datetime  open  high  low  close   date  toCombine
0    AAPL  2020-01-01 15:00     3     5    2      4  2020-01-01      False
1    AAPL  2020-01-01 15:30     4    10    4      8  2020-01-01      False
2    AAPL  2020-01-01 16:00     8    15    6     12  2020-01-01      False
3    AAPL  2020-01-01 18:00    12    20    8     16  2020-01-01       True
4    AAPL  2020-01-01 20:00    12    20    8     16  2020-01-01       True
5    AAPL  2020-01-02 07:00    15    24    9     19  2020-01-02       True
6    AAPL  2020-01-02 10:00    16    25   10     20  2020-01-02       True
7    AAPL  2020-01-02 12:00    20    30   12     24  2020-01-02      False
8    AAPL  2020-01-02 14:00    24    70   14     26  2020-01-02      False
9    AAPL  2020-01-02 16:00   103   105  102    104  2020-01-02      False
10   AAPL  2020-01-02 18:00   104   100  104    196  2020-01-02       True
11   AAPL  2020-01-03 08:00   108   110  106    112  2020-01-03       True
12   AAPL  2020-01-03 10:30   112   120  108    116  2020-01-03      False
13   AAPL  2020-01-03 13:00   115   124  109    119  2020-01-03      False

(忽略此处的索引值(

symbol          datetime  open  high  low  close   date  
0    AAPL  2020-01-01 15:00     3     5    2      4  2020-01-01      
1    AAPL  2020-01-01 15:30     4    10    4      8  2020-01-01      
2    AAPL  2020-01-01 16:00     8    15    6     12  2020-01-01      
6    AAPL  2020-01-02 10:00    12    25    8     20  2020-01-02  <---- aggregated row    
7    AAPL  2020-01-02 12:00    20    30   12     24  2020-01-02      
8    AAPL  2020-01-02 14:00    24    70   14     26  2020-01-02
9    AAPL  2020-01-02 16:00   103   105  102    104  2020-01-02 
11   AAPL  2020-01-03 10:00   103   110  100    112  2020-01-03  <---- aggregated 10:00 row created if not exist
12   AAPL  2020-01-03 10:30   112   120  108    116  2020-01-03
13   AAPL  2020-01-03 13:00   115   124  109    119  2020-01-03     

注:

  1. 已经创建了toCombine列,用于标记将聚合为具有10:00时间的datetime值的单行。

  2. 如果具有datetime值且时间为10:00的行不存在,则应创建该行。但是,如果也没有toCombine == True的行可从中聚合,则不需要创建10:00行。

谢谢!


Python代码设置问题

import pandas as pd
data = [
('AAPL', '2020-01-01 15:00', 3, 5, 2, 4, '2020-01-01', False),
('AAPL', '2020-01-01 15:30', 4, 10, 4, 8, '2020-01-01', False),
('AAPL', '2020-01-01 16:00', 8, 15, 6, 12, '2020-01-01', False),
('AAPL', '2020-01-01 18:00', 12, 20, 8, 16, '2020-01-01', True),        
('AAPL', '2020-01-01 20:00', 12, 20, 8, 16, '2020-01-01', True),        
('AAPL', '2020-01-02 07:00', 15, 24, 9, 19, '2020-01-02', True),       
('AAPL', '2020-01-02 10:00', 16, 25, 10, 20, '2020-01-02', True),
('AAPL', '2020-01-02 12:00', 20, 30, 12, 24, '2020-01-02', False),
('AAPL', '2020-01-02 14:00', 24, 70, 14, 26, '2020-01-02', False),
('AAPL', '2020-01-02 16:00', 103, 105, 102, 104, '2020-01-02', False),
('AAPL', '2020-01-02 18:00', 104, 100, 104, 196, '2020-01-02', True),     
('AAPL', '2020-01-03 08:00', 108, 110, 106, 112, '2020-01-03', True),    
('AAPL', '2020-01-03 10:30', 112, 120, 108, 116, '2020-01-03', False),     
('AAPL', '2020-01-03 13:00', 115, 124, 109, 119, '2020-01-03', False),     
]
df = pd.DataFrame(data, columns=['symbol', 'datetime', 'open', 'high', 'low', 'close', 'date', 'toCombine'])
print(df)

我的方法

#Group and agg
m = df['toCombine']
agg_dict = {'datetime' : 'last',
'open' : 'first',
'high' : 'max',
'low' : 'min',
'close' : 'last'}
reduce_df = (df.loc[m].groupby(['symbol',(~m).cumsum()],
as_index=False).agg(agg_dict))
print(reduce_df)
symbol          datetime  open  high  low  close
0   AAPL  2020-01-02 10:00    12    25    8     20
1   AAPL  2020-01-03 08:00   104   110  104    112

如果排序正确,按符号分组可能不必要


#get correct datetime, append and sort
datetime = pd.to_datetime(reduce_df['datetime'])
hours = datetime.dt.hour
df = (reduce_df.assign(datetime = (datetime.mask(hours.gt(16),
datetime.add(pd.to_timedelta(10 + 24 - hours,
unit='h')))
.mask(hours.lt(10),
datetime.add(pd.to_timedelta(10 - hours,
unit='h')))),            
date = lambda x: x['datetime'].dt.date)
.assign(datetime = lambda x: x['datetime'].dt.strftime('%Y-%m-%d %H:%M'))
.append(df.loc[~m])
.drop(columns='toCombine')
.sort_values(['symbol','datetime','date'])
)
print(df)

symbol          datetime  open  high  low  close        date
0    AAPL  2020-01-01 15:00     3     5    2      4  2020-01-01
1    AAPL  2020-01-01 15:30     4    10    4      8  2020-01-01
2    AAPL  2020-01-01 16:00     8    15    6     12  2020-01-01
0    AAPL  2020-01-02 10:00    12    25    8     20  2020-01-02
7    AAPL  2020-01-02 12:00    20    30   12     24  2020-01-02
8    AAPL  2020-01-02 14:00    24    70   14     26  2020-01-02
9    AAPL  2020-01-02 16:00   103   105  102    104  2020-01-02
1    AAPL  2020-01-03 10:00   104   110  104    112  2020-01-03
12   AAPL  2020-01-03 10:30   112   120  108    116  2020-01-03

我们以前可以使用dataframe.set_indexdataframe.asfreqffill进行分组,这样以后就不必更改日期时间,但我认为性能将与相似

如果要包含小时=16,则需要更改datetime.mask(hours.ge(16).......

当我们使用.loc时,包括它,因为它不包括在toCombine中(第9行错误(

最新更新