Python熊猫可以有效地发现数据帧中连续行的时间差



我有一个如下的数据帧:

payeeId     amount       createdAt                          TrxnID
1001        2.30      2021-04-24 01:40:11.156000+00:00     100AA
1001        35        2021-04-24 02:10:11.146000+00:00     100AB
1001        600       2021-04-24 02:12:14.309000+00:00     100AC
1002        100       2021-04-24 02:59:51.127000+00:00     110BD
1003        1900      2021-04-24 04:09:15.113000+00:00     120AC
1003        10        2021-04-24 04:19:40.132000+00:00     120AM

我想添加一个具有以下逻辑的标志:

If for a given 'PayeeId', the difference between two consecutive 'createdAt' is less than 300 seconds, then the flag will be set to 'No Settlement', else 'Approved'

因此生成的数据帧看起来像

payeeId    amount       createdAt                          TrxnID    Flag
1001       2.30      2021-04-24 01:40:11.156000+00:00     100AA    Approved
1001       35        2021-04-24 02:10:11.146000+00:00     100AB    Approved
1001       600       2021-04-24 02:12:14.309000+00:00     100AC    Not Approved
1002       100       2021-04-24 02:59:51.127000+00:00     110BD    Approved
1003       1900      2021-04-24 04:09:15.113000+00:00     120AC    Approved
1003       10        2021-04-24 04:19:40.132000+00:00     120AM    Approved

所以我尝试使用以下代码片段:

gs = df.groupby(['payeeId'])['createdAt']
df['Time_Diff'] = gs.diff().fillna(pd.Timedelta(seconds=0))/pd.Timedelta(seconds=300)
df['Flag'] = np.where(df_sub_count['Time_Diff']>0,'Approved','No Settlement')

但是上面的一个并没有产生想要的结果。我看到payeeID1002的"无结算"。这是不可取的。

我在这里错过了什么。

df['createdAt'] = pd.to_datetime(df.createdAt)
diff_sec=300.
mask= df.groupby('payeeId').createdAt.diff().astype('timedelta64[s]').fillna(diff_sec)
df['Flag'] = np.where(mask >= diff_sec, 'Approved', 'Not Approved') # plz replace 'Not Approved' with 'No Settlement' based on requirement.

输出

payeeId amount  createdAt                     TrxnID    Flag
0   1001    2.3 2021-04-24 01:40:11.156000+00:00    100AA   Approved
1   1001    35.0    2021-04-24 02:10:11.146000+00:00    100AB   Approved
2   1001    600.0   2021-04-24 02:12:14.309000+00:00    100AC   Not Approved
3   1002    100.0   2021-04-24 02:59:51.127000+00:00    110BD   Approved
4   1003    1900.0  2021-04-24 04:09:15.113000+00:00    120AC   Approved
5   1003    10.0    2021-04-24 04:19:40.132000+00:00    120AM   Approved

解释

我们通过payeeId对df进行分组,并以秒为单位获取createdAt之间的差异
由于我们希望每个payeeId的第一次出现都得到批准,因此用300填充na。
然后我们使用np.where根据条件创建Flag列。

最新更新