我正在处理一个艰巨的挑战,我不知道如何解决它。
我有一个这样的数据帧:
Product_Name Start_Time End_Time
Product X 2021-10-20 20:32:00 2021-10-21 03:50:00
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00
我有 3 个范围时间和每个范围和一个类别:
A: 05:01 to 14:00
B: 14:01 to 22:00
C: 22:01 to 05:00
我想做的是根据"Start_Time"和"End_Time"计算每个类别(A、B 和 C)有多少十进制小时数,达到这样的一些:
Product_Name Start_Time End_Time A B C
Product X 2021-10-20 20:30:00 2021-10-21 03:50:00 0.00 1.50 5.82
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00 2.17 1.98 0.00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00 8.98 3.31 6.98
你们能帮我怎么做吗?
我是python,pandas等的初学者,当我第一次写这篇文章时,我真的不知道如何开始编码它。 所以我开始思考一些事情,我得到了这段代码,我确定这是不对的,但我认为这是某件事的开始:
start_a = 05:01:00
end_a = 14:00:00
start_b = 14:01:00
end_b = 22:00:00
start_c = 22:01:00
end_c = 05:00:00
if df['Start_Time'] > start_a and df['End_Time'] < end_a:
df['A'] = ( df['End_Time'] - start_a ) - ( end_a - df['Start_Time'] )
else:
df['A'] = 0
if df['Start_Time'] > start_b and df['End_Time'] < end_b:
df['B'] = ( df['End_Time'] - start_b ) - ( end_b - df['Start_Time'] )
else:
df['B'] = 0
if df['Start_Time'] > start_c and df['End_Time'] < end_c:
df['C'] = ( df['End_Time'] - start_c ) - ( end_c - df['Start_Time'] )
else:
df['C'] = 0
你的问题比我想象的要困难得多。必须注意的一件事是,Start_Time
和End_Time
可以有不同的日期。此外,C类跨越两天。这两个事实使代码有点复杂,但它似乎有效。
首先,针对您的问题的设置。我创建了您的数据框并创建了变量。重要的是这些结构具有正确的数据类型。
import pandas as pd
from io import StringIO
from datetime import datetime, time, date, timedelta
# Create your data frame
data = StringIO("""Product_Name Start_Time End_Time
Product X 2021-10-20 20:32:00 2021-10-21 03:50:00
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00""")
df = pd.read_csv(data, sep=' ', engine='python')
# Convert the columns to date time format
df[["Start_Time", "End_Time"]] = df[["Start_Time", "End_Time"]].apply(pd.to_datetime)
# Create the range start and end time as datetime format
start_a = datetime.strptime('05:01:00', '%H:%M:%S')
end_a = datetime.strptime('14:00:00', '%H:%M:%S')
start_b = datetime.strptime('14:01:00', '%H:%M:%S')
end_b = datetime.strptime('22:00:00', '%H:%M:%S')
start_c = datetime.strptime('22:01:00', '%H:%M:%S')
end_c = datetime.strptime('05:00:00', '%H:%M:%S')
然后,我创建了一个函数,可以计算您问题的小时数。start
和end
是为一个范围定义的时间。该函数现在遍历天数,并查看您的范围适合多少。通常,它只需要一次迭代,但您的Product Z
需要两天,因此需要两次迭代。
def calc_hours(start_time, end_time, start, end):
# Set range to have date also => allows us to compare to start_time and end_time
range_start = datetime.combine(start_time.date(), start.time())
range_end = datetime.combine(start_time.date(), end.time())
# Special case for range C as end of range is on the next day
if (range_end<range_start):
range_end = range_end + timedelta(days=1)
# start_time and end_time can go over one or more days => Iterate over the days and sum the ours in the range
total_hours=0.0
while (range_start < end_time):
# Calculation to get the hours or zero if range is not within time frame
hours_in_frame = max((min(range_end, end_time) - max(range_start, start_time)).total_seconds(), 0)/3600
total_hours += hours_in_frame
# Increment the day to check if range is in time frame
range_start = range_start + timedelta(days=1)
range_end = range_end + timedelta(days=1)
return total_hours
为了使用该函数并将结果添加到数据帧,我使用了函数apply()
frompandas
.该apply()
获取数据帧的每一行,并使用前面显示的函数计算某个范围内的小时数。这是针对所有三个范围完成的。
# Use apply to calculate the hours for each row and each range
df['A'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_a, end_a), axis=1)
df['B'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_b, end_b), axis=1)
df['C'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_c, end_c), axis=1)
输出几乎是您想要的,但不能四舍五入到小数点后两位:
Product_Name Start_Time End_Time A B C
0 Product X 2021-10-20 20:32:00 2021-10-21 03:50:00 0.000000 1.466667 5.816667
1 Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00 2.166667 1.983333 0.000000
2 Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00 8.983333 3.316667 6.983333
另一种方法是创建一个系列,其中包含相关范围的所有序列号分钟数,然后将它们相交以获得重叠的持续时间。
没有时间提供完整的答案,但我想我会放弃这个想法,你可以从那里开始。
创建参考系列:
start = pd.Timestamp('22:01')
end = pd.Timestamp('05:00')
if end < start:
end += pd.Timedelta(days=1)
drC = pd.Series(pd.date_range(start=start, end=end, freq='min')).dt.hour * 60 +
pd.Series(pd.date_range(start=start, end=end, freq='min')).dt.minute
创建一个函数来执行交集和持续时间计算:
def intersecting_duration(x):
min_of_day = pd.Series(pd.Series(pd.date_range(start=x['Start_Time'], end=x['End_Time'], freq='min')).dt.hour * 60 +
pd.Series(pd.date_range(start=x['Start_Time'], end=x['End_Time'], freq='min')).dt.minute)
dur_mins = len(np.intersect1d(min_of_day, drC))
return 0 if (dur_mins == 0) else (dur_mins-1)/60
然后应用它:
df.apply(intersecting_duration, axis=1)
0 5.816667
1 0.000000
2 6.983333
你需要从那里拿走它。