熊猫根据重叠的时间范围计算总时数



我有以下数据帧

import pandas as pd
from datetime import datetime
df_dict = {
'id':[1,1,1,1,2,2,2,2],
'start_time':[
datetime.strptime('Jun 1 2020  1:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  2:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  3:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  4:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  1:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  2:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  3:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  4:30PM', '%b %d %Y %I:%M%p'),
],
'end_time':[
datetime.strptime('Jun 1 2020  2:45PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  3:00PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  4:50PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  4:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  3:45PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  5:00PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  5:50PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020  6:30PM', '%b %d %Y %I:%M%p'),
]
}
df = pd.DataFrame.from_dict(df_dict)
#    id          start_time            end_time
# 0   1 2020-06-01 13:30:00 2020-06-01 14:45:00
# 1   1 2020-06-01 14:30:00 2020-06-01 15:00:00
# 2   1 2020-06-01 15:30:00 2020-06-01 16:50:00
# 3   1 2020-06-01 16:30:00 2020-06-01 16:30:00
# 4   2 2020-06-01 13:30:00 2020-06-01 15:45:00
# 5   2 2020-06-01 14:30:00 2020-06-01 17:00:00
# 6   2 2020-06-01 15:30:00 2020-06-01 17:50:00
# 7   2 2020-06-01 16:30:00 2020-06-01 18:30:00

我想在不重复计算重叠间隔的情况下计算每个id的总小时数。

我有下面的代码,它给出了正确的结果

import sqlite3
conn = sqlite3.connect(':memory:')
df.to_sql('df', conn, index=False)
query = '''
SELECT id, SUM(CAST((JulianDay(end_time)-JulianDay(start_time))*24 AS real)) AS total_hours
FROM (
SELECT s1.id,
s1.start_time,
MIN(t1.end_time) AS end_time
FROM df s1 
INNER JOIN df t1 ON s1.start_time <= t1.end_time
AND s1.id = t1.id
AND NOT EXISTS(SELECT * FROM df t2 
WHERE t1.end_time >= t2.start_time AND t1.end_time < t2.end_time AND t2.id = t1.id) 
WHERE NOT EXISTS(SELECT * FROM df s2 
WHERE s1.start_time > s2.start_time AND s1.start_time <= s2.end_time AND s2.id = t1.id)
GROUP BY s1.start_time, s1.id
ORDER BY s1.id, s1.start_time
) x
GROUP BY id
'''
df = pd.read_sql_query(query, conn)
print(df)
#    id  total_hours
# 0   1     2.833333
# 1   2     5.000000

但我想知道是否有一种更好/更优雅的方法可以在不使用SQL的情况下实现这一点。

基本上,SQL代码所做的就是删除重叠的间隔。你也应该在这里这样做。我的建议如下:

def remove_overlap_intervals(intervals):
sorted = sorted(intervals, key=lambda tup: tup[0])
merged = []
for a in sorted:
if not merged:
merged.append(a)
else:
b = merged[-1]
if a[0] <= b[1]:
upper_bound = max(b[1], a[1])
merged[-1] = (b[0], upper_bound) 
else:
merged.append(a)
return merged

然后做你会做的事情,如果积分没有重叠:


df['time'] = df[['start_time', 'end_time']].apply(tuple, axis=1)
Grouped = df.groupby(['id'])['time'].apply(list)
Grouped_no_overlap = Grouped.apply(remove_overlap_intervals)
Grouped = Grouped_no_overlap.apply(lambda x: sum([(y[1]-y[0]).seconds for y in x]))/3600

它给出:

id
1    2.833333
2    5.000000
Name: time, dtype: float64

您可以使用panda的groupby函数。以下代码将完成工作:

import numpy as np
df['start_time_tmp']=np.where((df['start_time'] <= df['end_time'].shift(1))&
(df['end_time'] >= df['end_time'].shift(1)), df['end_time'].shift(1), df['start_time'])
df['diff'] = df['end_time']-df['start_time_tmp']
df.groupby(by='id')['diff'].sum().dt.total_seconds()/60/60

输出为:

id
1    2.833333
2    5.000000
Name: diff, dtype: float64

我希望,代码是不言自明的。如果你需要一些关于groupby功能的帮助,你可以查看文档

正如我所看到的,日期时间的时间分辨率被限制为分钟

因此,一个可能的解决方案是:

  • 对于组中的每一行(通过id(,生成频率为minutedate_range对象,除了各个范围的右边缘之外
  • 连接这些范围
  • 计算唯一值的总和(现在我们有分钟数(
  • 除以60,得到小时数

这样做的代码是:

  1. 定义getRng函数,从当前行生成date_range对象:

    def getRng(row):
    return pd.date_range(row.start_time, row.end_time, freq='min', closed='left')
    
  2. 定义getHrs函数来计算当前组的小时数:

    def getHrs(grp):
    return np.unique(np.hstack(grp.apply(getRng, axis=1))).size / 60
    

    我故意选择了Numpy函数,因为它们是已知的操作函数快于熊猫

  3. 将上述函数应用于每组(通过id(并转换结果到数据帧:

    result = df.groupby('id').apply(getHrs).rename('total_hours').reset_index()
    

对于您的数据样本,结果是:

id  total_hours
0   1     2.833333
1   2     5.000000

​我认为,这个解决方案更短(只有5行代码(,可读性更强比您的SQL和更多pandasonic

聚会迟到了,但由于我需要同样的功能,让我分享我的答案。它与coco18的答案类似,但可以处理区间重叠并完全包含在其他区间中的情况。

df = df.sort_values(["id", "start_time", "end_time"])
grouper = df.groupby(by="id")
# The max seen end date for the row and all predecessors.
df["max_seen_end"] = grouper["end_time"].cummax()
# Now the max end date of all former rows
df["max_seen_end"] = grouper["max_seen_end"].shift(1).fillna(pd.Timestamp.min)
# Ensure start/end dates do not overlap with former intervals by moving them
# to the right if needed.
corrected_start = df[["start_time", "max_seen_end"]].max(axis=1)
corrected_end = df[["end_time", "max_seen_end"]].max(axis=1)
df["hours"] = corrected_end - corrected_start
grouper["hours"].sum() / dt.timedelta(hours=1)

结果:

id
1    2.833333
2    5.000000
Name: hours, dtype: float64

最新更新