给定一个具有时间戳(ts)的数据帧,我想按小时(下采样)进行这些。以前由 ts 索引的值现在应根据一小时剩余的分钟数划分为比率。[注意:在进行重采样时将数据除以NaN列的比率]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
我想根据花费的时间和生产的比例来估计一个生产。这可以与how many lines of code have been completed or find how many actual lines per hour?
例如:在"2020-09-10 00:10:00"我们有 22。然后在 21:01 - 00:10 期间,我们根据
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
结果应该是这样的。
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
这种方法的问题:
在我看来,我们对这种方法存在严重问题。如果你看行[1] -> 2020-09-10 07:00:00,我们有4个,我们需要在3小时之间划分。将基本持续时间值视为 1(基本单位),我们得到
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
源
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
预期输出
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0
第一步是将"以前的 ts"列添加到源数据帧:
df['tsPrev'] = df.ts.shift()
然后将ts列设置为索引:
df.set_index('ts', inplace=True)
第三步是创建一个辅助索引,由原始索引组成 索引和"全时":
ind = df.event.resample('H').asfreq().index.union(df.index)
然后创建一个辅助数据帧,使用刚刚创建的索引重新编制索引 和"回填">事件列:
df2 = df.reindex(ind)
df2.event = df2.event.bfill()
定义要应用于df2中每组行的函数:
def parts(grp):
lstRow = grp.iloc[-1] # Last row from group
if pd.isna(lstRow.tsPrev): # First group
return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
# Other groups
return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)
.interpolate(method='index').round().diff(-1)[:-1].astype(int)
然后分两步生成"已生成"列的源数据:
生成详细数据:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1], sort=False).apply(parts).reset_index(level=0, drop=True)
源以df2的方式分组,每个组都终止 在持续时间列中具有非空值的行。然后每组 使用零件功能进行处理。
结果是:
2020-09-09 21:00:00 12 2020-09-09 21:01:00 7 2020-09-09 22:00:00 7 2020-09-09 23:00:00 7 2020-09-10 00:00:00 1 2020-09-10 00:10:00 80 2020-09-10 01:00:00 50 2020-09-10 01:31:00 60 2020-09-10 01:50:00 50 2020-09-10 01:51:00 26 2020-09-10 01:59:00 36 2020-09-10 02:00:00 36 2020-09-10 02:01:00 51 2020-09-10 02:51:00 57 2020-09-10 03:00:00 6 2020-09-10 03:01:00 78 2020-09-10 04:00:00 1 2020-09-10 04:01:00 85 2020-09-10 05:00:00 87 2020-09-10 06:00:00 7 dtype: int32
生成聚合数据,暂时也作为系列生成:
prod = prodDet.resample('H').sum().rename('produced')
这次prodDet被重新采样(按小时细分),并且 结果是值的总和。
结果是:
2020-09-09 21:00:00 19 2020-09-09 22:00:00 7 2020-09-09 23:00:00 7 2020-09-10 00:00:00 81 2020-09-10 01:00:00 222 2020-09-10 02:00:00 144 2020-09-10 03:00:00 84 2020-09-10 04:00:00 86 2020-09-10 05:00:00 87 2020-09-10 06:00:00 7 Freq: H, Name: produced, dtype: int32
让我们描述一下prodDet的内容:
2020-09-09 20:00:00没有行,因为没有源行 从这一小时开始(您的数据从21:01:00开始)。
第
21:00:00 12
行来自第一个源行(您忘记了 它写入预期结果)。
21:01:00、22:00:00、23:00:00 和 00:00:00的行来自行
00:10:00 a 22
的"分区",就像您的一部分一样 预期成果。
包含80和50的行来自包含130的行,分为 在小时为 00:01:00 和01:00:00的行之间。
等等。
现在我们开始组装最终结果。
将产品(转换为数据帧)与事件列联接:
result = prod.to_frame().join(df2.event)
添加tsMin列 - 每小时的最小ts(如您所要求 在评论之一中):
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
将索引更改为常规列,并将其名称设置为ts(如在源数据帧中):
result = result.reset_index().rename(columns={'index': 'ts'})
最终结果是:
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
例如,00:00:00的值81是1和80的总和(第一个 由具有130的行生成的部分),请参阅上面的prodDet。
tsMin列中的某些值为空,用于没有 源行。
如果要从第一行完全删除结果(使用持续时间 == 12),将return pd.Series([lstRow.duration]...
更改为return pd.Series([0]...
(第 4 行零件功能)。
综上所述,我的解决方案更熊猫,而且明显更短 比你的(17 行(我的解决方案)与大约 70 行(你的),不包括评论)。
我无法在pandas
中找到解决方案,所以我用普通的python创建了一个解决方案。
基本上,我在排序并将两个日期时间(即start_time
和end_time
)发送到执行处理的函数后迭代所有值。
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
现在,让我们运行每次迭代的代码
time_val=[]
split_f_val=[]
split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()
for ix, row in df.iterrows():
time_val.append(row[str(time_field)])
split_f_val.append(int(row[str(split_field)]))
# Skipping First Element for Processing. Therefore, having minimum two data values
if ix !=0:
# getting Last Two Values
new_time_list=time_val[-2:]
new_data_list=split_f_val[-2:]
# get times to compare
start_time=new_time_list[: -1][0]
end_time=new_time_list[1:][0]
# get latest Data to divide
data_ = new_data_list[1:][0]
# print(start_time)
# print(end_time)
df2 = get_ratio_per_hour(start_time,end_time, data_ )
df_final = pd.concat([df_final
, df2], ignore_index=True)
else:
# Create Empty DataFrame for First Value.
df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ],
columns=['hours', 'mins', 'ratios'])
df_final = pd.concat([df_final
, df2], ignore_index=True)
result = df_final.groupby(['hours'])['ratios'].sum()
中间数据帧:
hours mins ratios
0
0 2020-09-09 21:00:00 59.0 7.0
1 2020-09-09 22:00:00 60.0 7.0
2 2020-09-09 23:00:00 60.0 7.0
3 2020-09-10 00:00:00 10.0 1.0
0 2020-09-10 00:00:00 50.0 80.0
1 2020-09-10 01:00:00 31.0 50.0
0 2020-09-10 01:00:00 19.0 60.0
0 2020-09-10 01:00:00 1.0 50.0
0 2020-09-10 01:00:00 8.0 26.0
0 2020-09-10 01:00:00 1.0 36.0
1 2020-09-10 02:00:00 1.0 36.0
0 2020-09-10 02:00:00 50.0 51.0
0 2020-09-10 02:00:00 9.0 57.0
1 2020-09-10 03:00:00 1.0 6.0
0 2020-09-10 03:00:00 59.0 78.0
1 2020-09-10 04:00:00 1.0 1.0
0 2020-09-10 04:00:00 59.0 85.0
1 2020-09-10 05:00:00 60.0 87.0
2 2020-09-10 06:00:00 5.0 7.0
最终输出:
hours ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0