我有一个关于每个客户id的每日交易的大型数据集,示例如下。我试图计算每个客户的使用寿命,即客户第一次购买后的天数。在示例数据的上下文中,我想计算自上次购买以来2018-04-14
的天数:对于id1052
,长度为44。
trans_date 2018-03-01 2018-03-02 2018-03-03 ... 2018-04-12 2018-04-13
id
1052 30.00 0.0 0.0 ... 14.00 0.0
1053 0.0 0.0 10.64 ... 0.0 0.0
1054 0.0 8.40 0.0 ... 0.0 9.10
1055 0.0 0.0 0.0 ... 0.0 0.0
1056 0.0 0.0 0.0 ... 42.21 0.0
我知道如何找到行中的第一个非零值,但使用column_indextrans_date
计算最近性是我的难点。我该怎么做?
检查非零值后,这是idxmax
。我们需要为所有0添加一个额外的检查用例,因为idxmax
会为该行返回df.columns[0]
,这没有意义。在这里,我选择让它使用where
返回NaN
。
import pandas as pd
#df.columns = pd.to_datetime(df.columns)
my_date = pd.to_datetime('2018-04-14')
u = df.ne(0)
(my_date - u.idxmax(1).where(u.any(1))).dt.days
#id
#1052 44.0
#1053 42.0
#1054 43.0
#1055 NaN
#1056 2.0
#dtype: float64
试试这个:
# Convert the column names to dates
dates = pd.to_datetime(df.columns)
def customer_lifetime(row):
# Indices of the days where the customer has made purchase
purchases = np.where(row > 0)[0]
if len(purchases) == 0:
# The customer never bought anything
return 0
else:
# The index of the first purchase date
i = purchases[0]
# The "current" date
j = len(dates) - 1
# The customer lifetime is the difference between the
# current date and first purchase date
return (dates[j] - dates[i]) / pd.Timedelta(days=1)
df['CustomerLifetime'] = df.apply(customer_lifetime, axis=1)
结果:
2018-03-01 2018-03-02 2018-03-03 2018-04-12 2018-04-13 CustomerLifetime
id
1052 30.0 0.0 0.00 14.00 0.0 43.0
1053 0.0 0.0 10.64 0.00 0.0 41.0
1054 0.0 8.4 0.00 0.00 9.1 42.0
1055 0.0 0.0 0.00 0.00 0.0 0.0
1056 0.0 0.0 0.00 42.21 0.0 1.0