>我在熊猫中有以下数据帧
code start time end time quantity time_diff(in mins) lpm
123 12:37:00 13:35:00 6000 58 103.44
124 15:37:00 15:53:00 1000 16 62.5
time_diff = end_time - start_time
lpm = quantity / time_diff
现在,我想将此数量分成half_hourly桶,如下所示
code half_hourly_bucket quantity
123 12:30:01-13:00:00 2379.35
123 13:00:01-13:30:00 3103.50
123 13:30:01-14:00:00 517.25
124 15:30:01-16:00:00 1000
例如,如果我们考虑第一次观察,它的start_time
和end_time
分为 3 个半小时的桶,因此从12:37:00 to 13:00:00
开始,它已经分配了 2379.35 升燃料,就像其他桶一样。如果start_time and end_time
落入同一个半小时桶,我们不需要做任何事情。
如何在熊猫中实现这一点?我是蟒蛇的新手。
我不确定是否有一种巧妙的方式来处理输入表而无需从头开始创建一个新的输入表,所以这有点粗糙,但工作得很好:
import pandas as pd
from datetime import datetime, time, timedelta, date
import random
# --- make demo table ---
random.seed( 0 )
def makeRandomTable():
data = []
hour = 12
code = 100
for i in range(10):
row = { 'code': code }
code += 1
if random.random() < 0.18:
hour += 1
minute = random.randint(0,59)
row[ 'start_time' ] = datetime.combine( date.today(), time( hour=hour, minute=minute ) )
row[ 'end_time' ] = row[ 'start_time' ] + timedelta( minutes=random.randint( 0, 80 ) )
row[ 'quantity' ] = random.randint(1000,9000)
data.append( row )
df = pd.DataFrame( data )
df[ 'time_diff' ] = df[ 'end_time' ] - df[ 'start_time' ]
df[ 'time_diff' ] = df[ 'time_diff' ].apply( lambda x: x.total_seconds() / 60.0 )
return df[ ['code', 'start_time', 'end_time', 'time_diff', 'quantity' ] ]
df = makeRandomTable()
print( df )
所以我的输入数据看起来像这样:
code start_time end_time time_diff quantity
0 100 2018-09-30 12:48:00 2018-09-30 13:41:00 53.0 1331
1 101 2018-09-30 12:32:00 2018-09-30 13:34:00 62.0 4317
2 102 2018-09-30 12:53:00 2018-09-30 13:31:00 38.0 8928
3 103 2018-09-30 12:37:00 2018-09-30 13:04:00 27.0 5134
4 104 2018-09-30 13:08:00 2018-09-30 13:20:00 12.0 6065
5 105 2018-09-30 13:58:00 2018-09-30 15:06:00 68.0 6776
6 106 2018-09-30 13:57:00 2018-09-30 14:15:00 18.0 3540
7 107 2018-09-30 14:04:00 2018-09-30 14:46:00 42.0 4867
8 108 2018-09-30 14:22:00 2018-09-30 15:17:00 55.0 3590
9 109 2018-09-30 14:58:00 2018-09-30 15:24:00 26.0 8918
您首先需要获得整个时间段的 30 分钟四舍五入的开始和结束时间:
# determine period start timestamp
min_start = df[ 'start_time' ].min()
if min_start.round('30min') != min_start:
ts_start = (min_start - timedelta(minutes=15)).round('30min')
else:
ts_start = min_start
# determine period end timestamp
max_end = df[ 'end_time' ].max()
if max_end.round('30min') != max_end:
ts_end = (max_end + timedelta(minutes=15)).round('30min')
else:
ts_end = max_end
现在对输入数据进行重新采样:
baseindex = pd.DatetimeIndex( freq='30min', start=ts_start, end=ts_end )
rows = []
for start, end in zip( baseindex, baseindex[1:] ):
# filter non-overlapping periods
exclude_mask = (df[ 'end_time' ] < start) | (df[ 'start_time' ] > end)
for code, dfg in df[ ~exclude_mask ].groupby( 'code' ):
overlap = min( dfg.end_time.iat[0], end ) - max( dfg.start_time.iat[0], start )
rows.append( { 'start': start,
'code': code,
'overlap': overlap.total_seconds() / 60.0, # in mins
'orig_quantity': dfg.quantity.iat[0],
'orig_start_time': dfg.start_time.iat[0],
'orig_end_time': dfg.end_time.iat[0],
'orig_time_diff' : dfg.time_diff.iat[0],
} )
col_order = ['orig_quantity','orig_start_time','orig_end_time','orig_time_diff','overlap']
df2 = pd.DataFrame( rows ).set_index( ['start','code'] ).sort_index()[ col_order ]
df2[ 'frac_overlap' ] = df2[ 'overlap' ] / df2[ 'orig_time_diff' ]
df2[ 'quantity' ] = df2[ 'orig_quantity' ] * df2[ 'frac_overlap' ]
df2[ 'lpm' ] = df2[ 'quantity' ] / df2[ 'overlap' ]
print( df2 )
所以df2
现在就是这样...
orig_quantity orig_start_time orig_end_time orig_time_diff overlap frac_overlap quantity lpm
start code
2018-09-30 12:30:00 100 1331 2018-09-30 12:48:00 2018-09-30 13:41:00 53.0 12.0 0.226415 301.358491 25.113208
101 4317 2018-09-30 12:32:00 2018-09-30 13:34:00 62.0 28.0 0.451613 1949.612903 69.629032
102 8928 2018-09-30 12:53:00 2018-09-30 13:31:00 38.0 7.0 0.184211 1644.631579 234.947368
103 5134 2018-09-30 12:37:00 2018-09-30 13:04:00 27.0 23.0 0.851852 4373.407407 190.148148
2018-09-30 13:00:00 100 1331 2018-09-30 12:48:00 2018-09-30 13:41:00 53.0 30.0 0.566038 753.396226 25.113208
101 4317 2018-09-30 12:32:00 2018-09-30 13:34:00 62.0 30.0 0.483871 2088.870968 69.629032
102 8928 2018-09-30 12:53:00 2018-09-30 13:31:00 38.0 30.0 0.789474 7048.421053 234.947368
103 5134 2018-09-30 12:37:00 2018-09-30 13:04:00 27.0 4.0 0.148148 760.592593 190.148148
104 6065 2018-09-30 13:08:00 2018-09-30 13:20:00 12.0 12.0 1.000000 6065.000000 505.416667
2018-09-30 13:30:00 100 1331 2018-09-30 12:48:00 2018-09-30 13:41:00 53.0 11.0 0.207547 276.245283 25.113208
101 4317 2018-09-30 12:32:00 2018-09-30 13:34:00 62.0 4.0 0.064516 278.516129 69.629032
102 8928 2018-09-30 12:53:00 2018-09-30 13:31:00 38.0 1.0 0.026316 234.947368 234.947368
105 6776 2018-09-30 13:58:00 2018-09-30 15:06:00 68.0 2.0 0.029412 199.294118 99.647059
106 3540 2018-09-30 13:57:00 2018-09-30 14:15:00 18.0 3.0 0.166667 590.000000 196.666667
2018-09-30 14:00:00 105 6776 2018-09-30 13:58:00 2018-09-30 15:06:00 68.0 30.0 0.441176 2989.411765 99.647059
106 3540 2018-09-30 13:57:00 2018-09-30 14:15:00 18.0 15.0 0.833333 2950.000000 196.666667
107 4867 2018-09-30 14:04:00 2018-09-30 14:46:00 42.0 26.0 0.619048 3012.904762 115.880952
108 3590 2018-09-30 14:22:00 2018-09-30 15:17:00 55.0 8.0 0.145455 522.181818 65.272727
2018-09-30 14:30:00 105 6776 2018-09-30 13:58:00 2018-09-30 15:06:00 68.0 30.0 0.441176 2989.411765 99.647059
107 4867 2018-09-30 14:04:00 2018-09-30 14:46:00 42.0 16.0 0.380952 1854.095238 115.880952
108 3590 2018-09-30 14:22:00 2018-09-30 15:17:00 55.0 30.0 0.545455 1958.181818 65.272727
109 8918 2018-09-30 14:58:00 2018-09-30 15:24:00 26.0 2.0 0.076923 686.000000 343.000000
2018-09-30 15:00:00 105 6776 2018-09-30 13:58:00 2018-09-30 15:06:00 68.0 6.0 0.088235 597.882353 99.647059
108 3590 2018-09-30 14:22:00 2018-09-30 15:17:00 55.0 17.0 0.309091 1109.636364 65.272727
109 8918 2018-09-30 14:58:00 2018-09-30 15:24:00 26.0 24.0 0.923077 8232.000000 343.000000
哪个看起来差不多
因此,让我们回顾一下重要的一点:
for start, end in zip( baseindex, baseindex[1:] ):
# filter non-overlapping periods
exclude_mask = (df[ 'end_time' ] < start) | (df[ 'start_time' ] > end)
for code, dfg in df[ ~exclude_mask ].groupby( 'code' ):
overlap = min( dfg.end_time.iat[0], end ) - max( dfg.start_time.iat[0], start )
start
和end
取值 (2018-09-30 12:30:00, 2018-09-30 13:00:00(...(2018-09-30 13:00:00, 2018-09-30 13:30:00)...等等,以便我们可以在这些时间之间过滤df
。
exclude_mask
对于start_time
和end_time
与当前(start
,end
(完全没有重叠的每一行,True
。~exclude_mask
翻转True
/False
for code, dfg in df[ ~exclude_mask ].groupby( 'code' )
遍历按code
分组的具有重叠时间的行(即单个行(,并生成code
,这只是代码值,dfg
是表示组的数据帧(在这种情况下,它始终有 1 行(
overlap = min( dfg.end_time.iat[0], end ) - max( dfg.start_time.iat[0], start )
只是当前(start
,end
(期间的重叠。
希望这是有道理的
编辑
我们可以将数据帧整理为所需的输出:
>>> df3 = df2.reset_index()
>>> df3[ 'half_hourly_bucket' ] = df3['start'].apply( lambda x: str(x.time()) + '-' + str((x+timedelta(minutes=30)).time()) )
>>>
>>> df3[ ['code','half_hourly_bucket','quantity'] ].set_index('code').sort_index()
half_hourly_bucket quantity
code
100 12:30:00-13:00:00 301.358491
100 13:00:00-13:30:00 753.396226
100 13:30:00-14:00:00 276.245283
101 12:30:00-13:00:00 1949.612903
101 13:00:00-13:30:00 2088.870968
101 13:30:00-14:00:00 278.516129
102 12:30:00-13:00:00 1644.631579
102 13:00:00-13:30:00 7048.421053
102 13:30:00-14:00:00 234.947368
103 12:30:00-13:00:00 4373.407407
103 13:00:00-13:30:00 760.592593
104 13:00:00-13:30:00 6065.000000
105 15:00:00-15:30:00 597.882353
105 14:30:00-15:00:00 2989.411765
105 13:30:00-14:00:00 199.294118
105 14:00:00-14:30:00 2989.411765
106 14:00:00-14:30:00 2950.000000
106 13:30:00-14:00:00 590.000000
107 14:00:00-14:30:00 3012.904762
107 14:30:00-15:00:00 1854.095238
108 14:00:00-14:30:00 522.181818
108 14:30:00-15:00:00 1958.181818
108 15:00:00-15:30:00 1109.636364
109 14:30:00-15:00:00 686.000000
109 15:00:00-15:30:00 8232.000000
>>>