用Pandas中的null计算日期范围



我有一个数据帧,它有几个列。我把它按"名称"分组,按"日期1"排序。数据集的子集如下:-

data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
, 'Difference': [0, 0, '2 days', 0,0,0,0,'91 days',0,'181 days']}
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] =  pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] =  pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")

我想计算"差值"列中的值。它是下一个"Date1"值和按"Name"分组的非null"Date2"之间的差异。如果"Date2"为null,则可以将Difference设置为0。如何在Pandas中进行计算?应该像按函数检查组中的每一行,然后比较两列中的日期一样吗?

df_test['Difference'] = df_test['Date1'].sub(df_test.groupby('Name')['Date2'].ffill(limit=1)).dt.days
df_test['Difference']=np.where(df_test['Date2'].notna(),np.nan,df_test['Difference'])
Name      Date1      Date2  Difference
0   Alfie 2009-03-26        NaT         NaN
1   Alfie 2014-12-23 2016-03-16         NaN
2   Alfie 2016-03-18        NaT         2.0
3   Alfie 2021-08-31        NaT         NaN
4  George 2014-01-23        NaT         NaN
5  George 2020-02-20        NaT         NaN
6   Radar 2008-05-24 2010-03-23         NaN
7   Radar 2010-06-22        NaT        91.0
8   Radar 2015-09-10 2016-09-12         NaN
9   Radar 2017-03-11        NaT       180.0

这是我的尝试,它有点难看,但我认为它有效(除了最后的区别是180天而不是181天(:

import pandas as pd
import numpy as np

data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
}
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] =  pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] =  pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")
difference_list = [0]
rows = list(df_test.iterrows())
for i in range(len(rows)-1):
if not pd.isnull(rows[i][1]["Date2"]) and not pd.isnull(rows[i+1][1]["Date1"]): 
difference_list.append((rows[i+1][1]["Date1"] - rows[i][1]["Date2"]).days)
else:
difference_list.append(0)
df_test["Difference"] = difference_list
print(df_test)

从本质上讲,我所做的只是将所有行放入一个列表中,然后按照您指定的方式进行减法运算,其中从第I个日期2减去第(I+1(个日期1以得出差值。if和else只是为了说明何时存在空日期时间。

输出:

Name      Date1      Date2  Difference
0   Alfie 2009-03-26        NaT           0
1   Alfie 2014-12-23 2016-03-16           0
2   Alfie 2016-03-18        NaT           2
3   Alfie 2021-08-31        NaT           0
4  George 2014-01-23        NaT           0
5  George 2020-02-20        NaT           0
6   Radar 2008-05-24 2010-03-23           0
7   Radar 2010-06-22        NaT          91
8   Radar 2015-09-10 2016-09-12           0
9   Radar 2017-03-11        NaT         180

试试这个:

s = df_test['Date1'].sub(df_test.groupby('Name')['Date2'].shift())
s.where(s.notna(),0)

最新更新