在GroupBy中运行日期之间经过时间的计算



我有一个类似的df:

id start_dt    end_dt
1  2021-01-01  2021-01-14
1  2021-01-01  2021-01-15
1  2021-01-13  2021-01-16
2  2021-02-01  2021-01-04
2  2021-02-01  2021-01-07

我希望创建第三列"差异"这将表示按ID分组的天数的运行差异。计算将考虑GroupBy系列中前一个记录的start_dt和end_dt的重叠。

预期输出:

id start_dt    end_dt      diff
1  2021-01-01  2021-01-14  14
1  2021-01-01  2021-01-15  1
1  2021-01-13  2021-01-16  1
2  2021-02-01  2021-01-04  4
2  2021-02-01  2021-01-07  3

我用另一种语言编写并工作,现在我正在将其移植到Python,并且我一直被困在这上面。有办法做到这一点吗?

感谢

这不是一个真正的Python问题,而是一个Pandas问题。这里可以使用groupbyapply,但必须确保保存原始索引:

def make_diff(dg):
"""
the tranformation function that builds a Series with the differences in days
and uses an 'index' column as index
"""
prev = dg['end_dt'].shift()    # the end date of previous row
# compute the actual start date
start = np.where(prev.isna()|(prev<dg['start_dt']),
dg['start_dt'], prev)
# and returns the Series
return (dg.set_index('index')['end_dt'] - start).rename('diff')
df['diff'] = df.reset_index().groupby('id').apply(make_diff).reset_index(level=0)['diff']
对于您的示例数据框架,它如预期的那样给出:
id   start_dt     end_dt    diff
0   1 2021-01-01 2021-01-14 13 days
1   1 2021-01-01 2021-01-15  1 days
2   1 2021-01-13 2021-01-16  1 days
3   2 2021-02-01 2021-02-04  3 days
4   2 2021-02-01 2021-02-07  3 days

唯一的要求是每组中的行是正确排序的。

我们首先准备数据框架,以实现可重复性(据我所知,您的数据框架不一致)

import numpy as np
from io import StringIO
from datetime import datetime
data = StringIO(
"""
id start_dt    end_dt
1  2021-01-01  2021-01-14
1  2021-01-01  2021-01-15
1  2021-01-13  2021-01-16
2  2021-02-01  2021-02-04
2  2021-02-01  2021-02-07
""")
df = pd.read_csv(data, delim_whitespace = True)
df['start_dt'] = pd.to_datetime(df['start_dt'])
df['end_dt'] = pd.to_datetime(df['end_dt'])

接下来我们将日期转换为天数,作为偏移量到一些任意的base_date

base_day = datetime(day = 1,month = 1,year = 2021) # does not really matter
df['start_day'] = (df['start_dt'] - base_day).dt.days
df['end_day'] = (df['end_dt'] - base_day).dt.days

下我们

  • 找到end_day到前一行的滚动最大值,每组(因此offset(-1))
  • figure outfrom_day
  • 取差额
df['max_end_day'] = df.groupby('id')['end_day'].rolling(window = 100, min_periods = 1).max().reset_index(drop=True).shift(1)
df['from_day'] = df[['start_day','max_end_day']].max(axis=1)
df['diff'] = df['end_day'] - df['from_day']

我们得到(为了清楚起见,使用所有中间计算)


id  start_dt             end_dt                 start_day    end_day    max_end_day    from_day    diff
--  ----  -------------------  -------------------  -----------  ---------  -------------  ----------  ------
0     1  2021-01-01 00:00:00  2021-01-14 00:00:00            0         13            nan           0      13
1     1  2021-01-01 00:00:00  2021-01-15 00:00:00            0         14             13          13       1
2     1  2021-01-13 00:00:00  2021-01-16 00:00:00           12         15             14          14       1
3     2  2021-02-01 00:00:00  2021-02-04 00:00:00           31         34             15          31       3
4     2  2021-02-01 00:00:00  2021-02-07 00:00:00           31         37             34          34       3

尝试按id分组,然后找到start_dt的最小值和end_dt的最大值,然后创建一个新的列diff,这是天数的差异

id=[1,1,1,2,2]
start_dt =['2021-01-01','2021-01-01','2021-01-13','2021-02-01','2021-02-01']  
end_dt=['2021-01-14','2021-01-15','2021-01-16','2021-01-04','2021-01-07']

df=pd.DataFrame({'id':id,'start_dt':start_dt,'end_dt':end_dt})
df['start_dt']=df['start_dt'].astype(np.datetime64)
df['end_dt']=df['end_dt'].astype(np.datetime64)
print(df)
grouped=df.groupby('id').agg({'start_dt':'min','end_dt':'max'}).assign(diff=lambda row: (row['start_dt']-row['end_dt']).dt.days)

输出:

start_dt     end_dt  diff
id                            
1  2021-01-01 2021-01-16   -15
2  2021-02-01 2021-01-07    25

相关内容

  • 没有找到相关文章

最新更新