根据另一列(pandas)中的值计算单列中的天数差



我有一个熊猫df(称为df2),像这样:

id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |

orderdate的值是datetime,格式为'%Y%m%d'。它们代表客户的订单。我想计算每个id(每个客户端)的第一阶和第二阶之间的增量时间。

我想出了:

for i in list(set(df2.id)):
list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] ))) 
list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
min_list= list_sorted[0] # first element is first order

df2.loc[df2['id']==i, 'First Order']= min_list
if len(list_sorted)>1:
penultimate_list= list_sorted[1]
df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order

df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta



else:
df2.loc[df2['id_user']==i, 'Delta orders']= None

我的预期结果是:

id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01   | 2020-11-01  | 92 days
123|2020-08-01 |2020-08-01   | 2020-11-01  | 92 days
233|2020-07-01 |2020-07-01   | 2020-11-04  | 126 days
233|2020-11-04 |2020-07-01   | 2020-11-04  | 126 days
444|2020-11-04 |2020-04-01   | 2020-05-03  | 32 days
444|2020-05-03 |2020-04-01   | 2020-05-03  | 32 days
444|2020-04-01 |2020-04-01   | 2020-05-03  | 32 days
444|2020-11-25 |2020-04-01   | 2020-05-03  | 32 days

它可以工作,但我觉得它很麻烦。有更简单的方法吗?

与您想要的略有不同,但这是一个开始:

import pandas as pd
from io import StringIO
data = StringIO(
"""id|orderdate
123|2020-11-01 
123|2020-08-01 
233|2020-07-01 
233|2020-11-04 
444|2020-11-04 
444|2020-05-03 
444|2020-04-01 
444|2020-11-25 """)
df = pd.read_csv(data, sep='|')
df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)
def date_diff(df):
df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
df = df.dropna()
return df
# this calculates all order differences
df.groupby('id').apply(date_diff)
# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)

最新更新