我有一个如下的数据帧:
payeeId amount createdAt TrxnID
1001 2.30 2021-04-24 01:40:11.156000+00:00 100AA
1001 35 2021-04-24 02:10:11.146000+00:00 100AB
1001 600 2021-04-24 02:12:14.309000+00:00 100AC
1002 100 2021-04-24 02:59:51.127000+00:00 110BD
1003 1900 2021-04-24 04:09:15.113000+00:00 120AC
1003 10 2021-04-24 04:19:40.132000+00:00 120AM
我想添加一个具有以下逻辑的标志:
If for a given 'PayeeId', the difference between two consecutive 'createdAt' is less than 300 seconds, then the flag will be set to 'No Settlement', else 'Approved'
因此生成的数据帧看起来像
payeeId amount createdAt TrxnID Flag
1001 2.30 2021-04-24 01:40:11.156000+00:00 100AA Approved
1001 35 2021-04-24 02:10:11.146000+00:00 100AB Approved
1001 600 2021-04-24 02:12:14.309000+00:00 100AC Not Approved
1002 100 2021-04-24 02:59:51.127000+00:00 110BD Approved
1003 1900 2021-04-24 04:09:15.113000+00:00 120AC Approved
1003 10 2021-04-24 04:19:40.132000+00:00 120AM Approved
所以我尝试使用以下代码片段:
gs = df.groupby(['payeeId'])['createdAt']
df['Time_Diff'] = gs.diff().fillna(pd.Timedelta(seconds=0))/pd.Timedelta(seconds=300)
df['Flag'] = np.where(df_sub_count['Time_Diff']>0,'Approved','No Settlement')
但是上面的一个并没有产生想要的结果。我看到payeeID
1002的"无结算"。这是不可取的。
我在这里错过了什么。
df['createdAt'] = pd.to_datetime(df.createdAt)
diff_sec=300.
mask= df.groupby('payeeId').createdAt.diff().astype('timedelta64[s]').fillna(diff_sec)
df['Flag'] = np.where(mask >= diff_sec, 'Approved', 'Not Approved') # plz replace 'Not Approved' with 'No Settlement' based on requirement.
输出
payeeId amount createdAt TrxnID Flag
0 1001 2.3 2021-04-24 01:40:11.156000+00:00 100AA Approved
1 1001 35.0 2021-04-24 02:10:11.146000+00:00 100AB Approved
2 1001 600.0 2021-04-24 02:12:14.309000+00:00 100AC Not Approved
3 1002 100.0 2021-04-24 02:59:51.127000+00:00 110BD Approved
4 1003 1900.0 2021-04-24 04:09:15.113000+00:00 120AC Approved
5 1003 10.0 2021-04-24 04:19:40.132000+00:00 120AM Approved
解释。
我们通过payeeId
对df进行分组,并以秒为单位获取createdAt
之间的差异
由于我们希望每个payeeId
的第一次出现都得到批准,因此用300填充na。
然后我们使用np.where
根据条件创建Flag
列。