Pandas:分块时计算每日统计数据

考虑一个postgres表，其中对于日期2022-05-01，我们有200个不同时间的值：

time                        value                                                                                
2022-05-01 00:17:20+00:00  17175 
2022-05-01 13:33:56+00:00  18000
...

我需要逐块读取chunk_size=50的数据。通过重新采样和聚合来计算每日统计数据，会产生四个相同的索引，每个索引都包含聚合值的一部分。

with engine.connect().execution_options(stream_results=True) as conn:
for chunk_df in pd.read_sql(query, engine, chunksize=50):
chunk_df.index = pd.to_datetime(chunk_df.time, utc=pytz.utc)
chunk_df.sort_index(inplace=True)
result_df = chunk_df.resample('1D').agg('sum')
time                        value                                                                                
2022-05-01 00:00:00+00:00  52175 

time                        value                                                                                
2022-05-01 00:00:00+00:00  12001 

time                        value                                                                                
2022-05-01 00:00:00+00:00  3506 

time                        value                                                                                
2022-05-01 00:00:00+00:00  45623

我想知道有没有任何解决方案可以直接计算正确的聚合值。换句话说，我们如何根据重新采样过程的时间间隔来设置块大小。

time                        value                                                                                
2022-05-01 00:00:00+00:00  113305

如果我得到了你想要的正确答案，那么像这样的查询就可以了：

select date_trunc('day', time), sum(value) from table_name group by 1;

您也可以添加

order by 1 asc/desc对其进行排序
按日期筛选的where date_trunc('day', time) = '2020-03-16 00:00:00'

相关内容

最新更新

热门标签：