我有一个类似的df:
id start_dt end_dt
1 2021-01-01 2021-01-14
1 2021-01-01 2021-01-15
1 2021-01-13 2021-01-16
2 2021-02-01 2021-01-04
2 2021-02-01 2021-01-07
我希望创建第三列"差异"这将表示按ID分组的天数的运行差异。计算将考虑GroupBy系列中前一个记录的start_dt和end_dt的重叠。
预期输出:
id start_dt end_dt diff
1 2021-01-01 2021-01-14 14
1 2021-01-01 2021-01-15 1
1 2021-01-13 2021-01-16 1
2 2021-02-01 2021-01-04 4
2 2021-02-01 2021-01-07 3
我用另一种语言编写并工作,现在我正在将其移植到Python,并且我一直被困在这上面。有办法做到这一点吗?
感谢这不是一个真正的Python问题,而是一个Pandas问题。这里可以使用groupby
和apply
,但必须确保保存原始索引:
def make_diff(dg):
"""
the tranformation function that builds a Series with the differences in days
and uses an 'index' column as index
"""
prev = dg['end_dt'].shift() # the end date of previous row
# compute the actual start date
start = np.where(prev.isna()|(prev<dg['start_dt']),
dg['start_dt'], prev)
# and returns the Series
return (dg.set_index('index')['end_dt'] - start).rename('diff')
df['diff'] = df.reset_index().groupby('id').apply(make_diff).reset_index(level=0)['diff']
对于您的示例数据框架,它如预期的那样给出:
id start_dt end_dt diff
0 1 2021-01-01 2021-01-14 13 days
1 1 2021-01-01 2021-01-15 1 days
2 1 2021-01-13 2021-01-16 1 days
3 2 2021-02-01 2021-02-04 3 days
4 2 2021-02-01 2021-02-07 3 days
唯一的要求是每组中的行是正确排序的。
我们首先准备数据框架,以实现可重复性(据我所知,您的数据框架不一致)
import numpy as np
from io import StringIO
from datetime import datetime
data = StringIO(
"""
id start_dt end_dt
1 2021-01-01 2021-01-14
1 2021-01-01 2021-01-15
1 2021-01-13 2021-01-16
2 2021-02-01 2021-02-04
2 2021-02-01 2021-02-07
""")
df = pd.read_csv(data, delim_whitespace = True)
df['start_dt'] = pd.to_datetime(df['start_dt'])
df['end_dt'] = pd.to_datetime(df['end_dt'])
接下来我们将日期转换为天数,作为偏移量到一些任意的base_date
base_day = datetime(day = 1,month = 1,year = 2021) # does not really matter
df['start_day'] = (df['start_dt'] - base_day).dt.days
df['end_day'] = (df['end_dt'] - base_day).dt.days
下我们
- 找到
end_day
到前一行的滚动最大值,每组(因此offset(-1)
) - figure out
from_day
- 取差额
df['max_end_day'] = df.groupby('id')['end_day'].rolling(window = 100, min_periods = 1).max().reset_index(drop=True).shift(1)
df['from_day'] = df[['start_day','max_end_day']].max(axis=1)
df['diff'] = df['end_day'] - df['from_day']
我们得到(为了清楚起见,使用所有中间计算)
id start_dt end_dt start_day end_day max_end_day from_day diff
-- ---- ------------------- ------------------- ----------- --------- ------------- ---------- ------
0 1 2021-01-01 00:00:00 2021-01-14 00:00:00 0 13 nan 0 13
1 1 2021-01-01 00:00:00 2021-01-15 00:00:00 0 14 13 13 1
2 1 2021-01-13 00:00:00 2021-01-16 00:00:00 12 15 14 14 1
3 2 2021-02-01 00:00:00 2021-02-04 00:00:00 31 34 15 31 3
4 2 2021-02-01 00:00:00 2021-02-07 00:00:00 31 37 34 34 3
尝试按id分组,然后找到start_dt的最小值和end_dt的最大值,然后创建一个新的列diff,这是天数的差异
id=[1,1,1,2,2]
start_dt =['2021-01-01','2021-01-01','2021-01-13','2021-02-01','2021-02-01']
end_dt=['2021-01-14','2021-01-15','2021-01-16','2021-01-04','2021-01-07']
df=pd.DataFrame({'id':id,'start_dt':start_dt,'end_dt':end_dt})
df['start_dt']=df['start_dt'].astype(np.datetime64)
df['end_dt']=df['end_dt'].astype(np.datetime64)
print(df)
grouped=df.groupby('id').agg({'start_dt':'min','end_dt':'max'}).assign(diff=lambda row: (row['start_dt']-row['end_dt']).dt.days)
输出:
start_dt end_dt diff
id
1 2021-01-01 2021-01-16 -15
2 2021-02-01 2021-01-07 25