我有两个数据帧:
email_date_df:
email_id |customer_id| email_date | email_opened
001 | 1000 | 03-02-21 | 1
002 | 1001 | 03-22-21 | 0
003 | 1002 | 04-02-21 | 1
004 | 1003 | 05-02-21 | 1
transaction_df:
trans_id |customer_id| trans_date | amount
001 | 1000 | 03-04-21 | $10
002 | 1001 | 04-30-21 | $24
003 | 1001 | 05-02-21 | $14
004 | 1003 | 04-10-21 | $149
我想了解发送给客户的每封电子邮件,是否在30天内发生了交易。我合并了基于customer_id
的日期框架,但有太多重复的行和数据。
是否有一种方法,我可以通过transaction_df
搜索email_date_df
中的每一行,看看是否有30天内的交易?
输出如下:
email_date_df:
email_id |customer_id| email_date | email_opened | transaction_witin_30_days
001 | 1000 | 03-02-21 | 1 | 1
002 | 1001 | 03-22-21 | 0 | 0
003 | 1002 | 04-02-21 | 1 | 0
004 | 1003 | 05-02-21 | 1 | 0
我认为你接近做一个基于customer_id
的合并。
首先,对两个dataframe进行左合并。对于客户收到的每封电子邮件,将有一行记录他们记录的每笔交易:
merge = email_date_df.merge(transaction_df, on='customer_id', how='left')
第二,找出所有电子邮件日期和交易日期之间的差异。还要制作一个二进制列,指示差异是否在30天内:
merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')
然后,对于每个唯一的电子邮件(email_id
),检查在30天内合并的条目是否有任何差异。对组执行以下操作:
grouped = merge.groupby(['email_id'])['within30'].any().astype(int)
最后,将这些分组值映射回原始数据:
email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)
结果如下:
email_id customer_id email_date email_opened transaction_within_30_days
0 1 1000 2021-03-02 1 1
1 2 1001 2021-03-22 0 0
2 3 1002 2021-04-02 1 0
3 4 1003 2021-05-02 1 0
完整代码:
merge = email_date_df.merge(transaction_df, on='customer_id', how='left')
merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')
grouped = merge.groupby(['email_id'])['within30'].any().astype(int)
email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)
的一些变量检查澄清。
>>> merge
email_id customer_id email_date ... amount diff within30
0 1 1000 2021-03-02 ... 10.0 2 days True
1 2 1001 2021-03-22 ... 24.0 39 days False
2 2 1001 2021-03-22 ... 14.0 41 days False
3 3 1002 2021-04-02 ... NaN NaT False
4 4 1003 2021-05-02 ... 149.0 -22 days False
[5 rows x 9 columns]
>>> grouped
email_id
1 1
2 0
3 0
4 0
Name: within30, dtype: int64