如何在熊猫中动态添加时间桶



>我在熊猫中有以下数据帧

code    start time      end time     quantity   time_diff(in mins) lpm
123     12:37:00        13:35:00     6000       58                 103.44
124     15:37:00        15:53:00     1000       16                 62.5
time_diff = end_time - start_time
lpm = quantity / time_diff

现在,我想将此数量分成half_hourly桶,如下所示

code   half_hourly_bucket     quantity
123    12:30:01-13:00:00      2379.35
123    13:00:01-13:30:00      3103.50
123    13:30:01-14:00:00      517.25
124    15:30:01-16:00:00      1000

例如,如果我们考虑第一次观察,它的start_timeend_time分为 3 个半小时的桶,因此从12:37:00 to 13:00:00开始,它已经分配了 2379.35 升燃料,就像其他桶一样。如果start_time and end_time落入同一个半小时桶,我们不需要做任何事情。

如何在熊猫中实现这一点?我是蟒蛇的新手。

我不确定是否有一种巧妙的方式来处理输入表而无需从头开始创建一个新的输入表,所以这有点粗糙,但工作得很好:

import pandas as pd
from datetime import datetime, time, timedelta, date
import random
# --- make demo table ---
random.seed( 0 )
def makeRandomTable():
data = []
hour = 12
code = 100
for i in range(10):
row = { 'code': code }
code += 1
if random.random() < 0.18:
hour += 1
minute = random.randint(0,59)
row[ 'start_time' ] = datetime.combine( date.today(), time( hour=hour, minute=minute ) )
row[ 'end_time' ] = row[ 'start_time' ] + timedelta( minutes=random.randint( 0, 80 ) )
row[ 'quantity' ] = random.randint(1000,9000)
data.append( row )
df = pd.DataFrame( data )
df[ 'time_diff' ] = df[ 'end_time' ] - df[ 'start_time' ]
df[ 'time_diff' ] = df[ 'time_diff' ].apply( lambda x: x.total_seconds() / 60.0 )
return df[ ['code', 'start_time', 'end_time', 'time_diff', 'quantity' ] ]

df = makeRandomTable()
print( df )

所以我的输入数据看起来像这样:

code          start_time            end_time  time_diff  quantity
0   100 2018-09-30 12:48:00 2018-09-30 13:41:00       53.0      1331
1   101 2018-09-30 12:32:00 2018-09-30 13:34:00       62.0      4317
2   102 2018-09-30 12:53:00 2018-09-30 13:31:00       38.0      8928
3   103 2018-09-30 12:37:00 2018-09-30 13:04:00       27.0      5134
4   104 2018-09-30 13:08:00 2018-09-30 13:20:00       12.0      6065
5   105 2018-09-30 13:58:00 2018-09-30 15:06:00       68.0      6776
6   106 2018-09-30 13:57:00 2018-09-30 14:15:00       18.0      3540
7   107 2018-09-30 14:04:00 2018-09-30 14:46:00       42.0      4867
8   108 2018-09-30 14:22:00 2018-09-30 15:17:00       55.0      3590
9   109 2018-09-30 14:58:00 2018-09-30 15:24:00       26.0      8918

您首先需要获得整个时间段的 30 分钟四舍五入的开始和结束时间:

# determine period start timestamp
min_start = df[ 'start_time' ].min()
if min_start.round('30min') != min_start:
ts_start = (min_start - timedelta(minutes=15)).round('30min')
else:
ts_start = min_start
# determine period end timestamp
max_end = df[ 'end_time' ].max()
if max_end.round('30min') != max_end:
ts_end = (max_end + timedelta(minutes=15)).round('30min')
else:
ts_end = max_end

现在对输入数据进行重新采样:

baseindex = pd.DatetimeIndex( freq='30min', start=ts_start, end=ts_end )
rows = []
for start, end in zip( baseindex, baseindex[1:] ):
# filter non-overlapping periods
exclude_mask = (df[ 'end_time' ] < start) | (df[ 'start_time' ] > end)
for code, dfg in df[ ~exclude_mask ].groupby( 'code' ):
overlap = min( dfg.end_time.iat[0], end ) - max( dfg.start_time.iat[0], start )
rows.append( { 'start': start,
'code': code,
'overlap': overlap.total_seconds() / 60.0, # in mins
'orig_quantity': dfg.quantity.iat[0],
'orig_start_time': dfg.start_time.iat[0],
'orig_end_time': dfg.end_time.iat[0],
'orig_time_diff' : dfg.time_diff.iat[0],
} )
col_order = ['orig_quantity','orig_start_time','orig_end_time','orig_time_diff','overlap']
df2 = pd.DataFrame( rows ).set_index( ['start','code'] ).sort_index()[ col_order ]
df2[ 'frac_overlap' ] = df2[ 'overlap' ] / df2[ 'orig_time_diff' ]
df2[ 'quantity' ] = df2[ 'orig_quantity' ] * df2[ 'frac_overlap' ]
df2[ 'lpm' ] = df2[ 'quantity' ] / df2[ 'overlap' ]
print( df2 )

所以df2现在就是这样...

orig_quantity     orig_start_time       orig_end_time  orig_time_diff  overlap  frac_overlap     quantity         lpm
start               code
2018-09-30 12:30:00 100            1331 2018-09-30 12:48:00 2018-09-30 13:41:00            53.0     12.0      0.226415   301.358491   25.113208
101            4317 2018-09-30 12:32:00 2018-09-30 13:34:00            62.0     28.0      0.451613  1949.612903   69.629032
102            8928 2018-09-30 12:53:00 2018-09-30 13:31:00            38.0      7.0      0.184211  1644.631579  234.947368
103            5134 2018-09-30 12:37:00 2018-09-30 13:04:00            27.0     23.0      0.851852  4373.407407  190.148148
2018-09-30 13:00:00 100            1331 2018-09-30 12:48:00 2018-09-30 13:41:00            53.0     30.0      0.566038   753.396226   25.113208
101            4317 2018-09-30 12:32:00 2018-09-30 13:34:00            62.0     30.0      0.483871  2088.870968   69.629032
102            8928 2018-09-30 12:53:00 2018-09-30 13:31:00            38.0     30.0      0.789474  7048.421053  234.947368
103            5134 2018-09-30 12:37:00 2018-09-30 13:04:00            27.0      4.0      0.148148   760.592593  190.148148
104            6065 2018-09-30 13:08:00 2018-09-30 13:20:00            12.0     12.0      1.000000  6065.000000  505.416667
2018-09-30 13:30:00 100            1331 2018-09-30 12:48:00 2018-09-30 13:41:00            53.0     11.0      0.207547   276.245283   25.113208
101            4317 2018-09-30 12:32:00 2018-09-30 13:34:00            62.0      4.0      0.064516   278.516129   69.629032
102            8928 2018-09-30 12:53:00 2018-09-30 13:31:00            38.0      1.0      0.026316   234.947368  234.947368
105            6776 2018-09-30 13:58:00 2018-09-30 15:06:00            68.0      2.0      0.029412   199.294118   99.647059
106            3540 2018-09-30 13:57:00 2018-09-30 14:15:00            18.0      3.0      0.166667   590.000000  196.666667
2018-09-30 14:00:00 105            6776 2018-09-30 13:58:00 2018-09-30 15:06:00            68.0     30.0      0.441176  2989.411765   99.647059
106            3540 2018-09-30 13:57:00 2018-09-30 14:15:00            18.0     15.0      0.833333  2950.000000  196.666667
107            4867 2018-09-30 14:04:00 2018-09-30 14:46:00            42.0     26.0      0.619048  3012.904762  115.880952
108            3590 2018-09-30 14:22:00 2018-09-30 15:17:00            55.0      8.0      0.145455   522.181818   65.272727
2018-09-30 14:30:00 105            6776 2018-09-30 13:58:00 2018-09-30 15:06:00            68.0     30.0      0.441176  2989.411765   99.647059
107            4867 2018-09-30 14:04:00 2018-09-30 14:46:00            42.0     16.0      0.380952  1854.095238  115.880952
108            3590 2018-09-30 14:22:00 2018-09-30 15:17:00            55.0     30.0      0.545455  1958.181818   65.272727
109            8918 2018-09-30 14:58:00 2018-09-30 15:24:00            26.0      2.0      0.076923   686.000000  343.000000
2018-09-30 15:00:00 105            6776 2018-09-30 13:58:00 2018-09-30 15:06:00            68.0      6.0      0.088235   597.882353   99.647059
108            3590 2018-09-30 14:22:00 2018-09-30 15:17:00            55.0     17.0      0.309091  1109.636364   65.272727
109            8918 2018-09-30 14:58:00 2018-09-30 15:24:00            26.0     24.0      0.923077  8232.000000  343.000000

哪个看起来差不多

因此,让我们回顾一下重要的一点:

for start, end in zip( baseindex, baseindex[1:] ):
# filter non-overlapping periods
exclude_mask = (df[ 'end_time' ] < start) | (df[ 'start_time' ] > end)
for code, dfg in df[ ~exclude_mask ].groupby( 'code' ):
overlap = min( dfg.end_time.iat[0], end ) - max( dfg.start_time.iat[0], start )

startend取值 (2018-09-30 12:30:00, 2018-09-30 13:00:00(...(2018-09-30 13:00:00, 2018-09-30 13:30:00)...等等,以便我们可以在这些时间之间过滤df

exclude_mask对于start_timeend_time与当前(startend(完全没有重叠的每一行,True~exclude_mask翻转True/False

for code, dfg in df[ ~exclude_mask ].groupby( 'code' )遍历按code分组的具有重叠时间的行(即单个行(,并生成code,这只是代码值,dfg是表示组的数据帧(在这种情况下,它始终有 1 行(

overlap = min( dfg.end_time.iat[0], end ) - max( dfg.start_time.iat[0], start )只是当前(startend(期间的重叠。

希望这是有道理的


编辑

我们可以将数据帧整理为所需的输出:

>>> df3 = df2.reset_index()
>>> df3[ 'half_hourly_bucket' ] = df3['start'].apply( lambda x: str(x.time()) + '-' + str((x+timedelta(minutes=30)).time()) )
>>>
>>> df3[ ['code','half_hourly_bucket','quantity'] ].set_index('code').sort_index()
half_hourly_bucket     quantity
code
100   12:30:00-13:00:00   301.358491
100   13:00:00-13:30:00   753.396226
100   13:30:00-14:00:00   276.245283
101   12:30:00-13:00:00  1949.612903
101   13:00:00-13:30:00  2088.870968
101   13:30:00-14:00:00   278.516129
102   12:30:00-13:00:00  1644.631579
102   13:00:00-13:30:00  7048.421053
102   13:30:00-14:00:00   234.947368
103   12:30:00-13:00:00  4373.407407
103   13:00:00-13:30:00   760.592593
104   13:00:00-13:30:00  6065.000000
105   15:00:00-15:30:00   597.882353
105   14:30:00-15:00:00  2989.411765
105   13:30:00-14:00:00   199.294118
105   14:00:00-14:30:00  2989.411765
106   14:00:00-14:30:00  2950.000000
106   13:30:00-14:00:00   590.000000
107   14:00:00-14:30:00  3012.904762
107   14:30:00-15:00:00  1854.095238
108   14:00:00-14:30:00   522.181818
108   14:30:00-15:00:00  1958.181818
108   15:00:00-15:30:00  1109.636364
109   14:30:00-15:00:00   686.000000
109   15:00:00-15:30:00  8232.000000
>>>

最新更新