在一个时间框架内根据ID和Date组合数据帧



我有两个数据帧:

email_date_df:

email_id |customer_id| email_date  | email_opened
001   |  1000     | 03-02-21    |       1
002   |  1001     | 03-22-21    |       0
003   |  1002     | 04-02-21    |       1
004   |  1003     | 05-02-21    |       1

transaction_df:

trans_id |customer_id| trans_date  | amount
001   |  1000     | 03-04-21    |   $10
002   |  1001     | 04-30-21    |   $24
003   |  1001     | 05-02-21    |   $14
004   |  1003     | 04-10-21    |   $149

我想了解发送给客户的每封电子邮件,是否在30天内发生了交易。我合并了基于customer_id的日期框架,但有太多重复的行和数据。

是否有一种方法,我可以通过transaction_df搜索email_date_df中的每一行,看看是否有30天内的交易?

输出如下:

email_date_df:

email_id |customer_id| email_date  | email_opened  | transaction_witin_30_days
001   |  1000     | 03-02-21    |       1       |       1
002   |  1001     | 03-22-21    |       0       |       0
003   |  1002     | 04-02-21    |       1       |       0
004   |  1003     | 05-02-21    |       1       |       0

我认为你接近做一个基于customer_id的合并。

首先,对两个dataframe进行左合并。对于客户收到的每封电子邮件,将有一行记录他们记录的每笔交易:

merge = email_date_df.merge(transaction_df, on='customer_id', how='left')

第二,找出所有电子邮件日期和交易日期之间的差异。还要制作一个二进制列,指示差异是否在30天内:

merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')

然后,对于每个唯一的电子邮件(email_id),检查在30天内合并的条目是否有任何差异。对组执行以下操作:

grouped = merge.groupby(['email_id'])['within30'].any().astype(int)

最后,将这些分组值映射回原始数据:

email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)

结果如下:

email_id  customer_id email_date  email_opened  transaction_within_30_days
0         1         1000 2021-03-02             1                           1
1         2         1001 2021-03-22             0                           0
2         3         1002 2021-04-02             1                           0
3         4         1003 2021-05-02             1                           0

完整代码:

merge = email_date_df.merge(transaction_df, on='customer_id', how='left')
merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')
grouped = merge.groupby(['email_id'])['within30'].any().astype(int)
email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)

的一些变量检查澄清。

>>> merge
email_id  customer_id email_date  ...  amount     diff within30
0         1         1000 2021-03-02  ...    10.0   2 days     True
1         2         1001 2021-03-22  ...    24.0  39 days    False
2         2         1001 2021-03-22  ...    14.0  41 days    False
3         3         1002 2021-04-02  ...     NaN      NaT    False
4         4         1003 2021-05-02  ...   149.0 -22 days    False
[5 rows x 9 columns]
>>> grouped
email_id
1    1
2    0
3    0
4    0
Name: within30, dtype: int64