假设我们有以下带有输出窗口_num的数据集:
index user1 date different_months org_different_months window_num
1690289 2670088 2006-08-01 243.0 243.0 1
1772121 2717874 2005-12-01 0.0 0.0 1
1772123 2717874 2005-12-01 0.0 0.0 1
1772125 2717874 2005-12-01 0.0 0.0 1
1772130 2717874 2005-12-01 0.0 0.0 1
1772136 2717874 2006-01-01 0.0 0.0 1
1772132 2717874 2006-02-01 0.0 2099.0 1
1772134 2717874 2020-08-27 0.0 0.0 4
1772117 2717874 0.0 0.0 4
1772118 2717874 0.0 0.0 4
1772128 2717874 2019-11-01 300.0 300.0 3
1772127 2717874 2011-11-01 2922.0 2922.0 2
1774815 2719456 2006-09-01 0.0 0.0 2
1774809 2719456 2006-10-01 0.0 1949.0 2
1774821 2719456 2020-05-20 0.0 0.0 7
1774803 2719456 0.0 0.0 7
1774806 2719456 0.0 0.0 7
1774819 2719456 2019-08-29 265.0 265.0 6
1774825 2719456 2014-10-01 384.0 384.0 4
1774812 2719456 2005-07-01 427.0 427.0 1
1774816 2719456 2012-02-01 973.0 973.0 3
1774824 2719456 2015-10-20 1409.0 1409.0 5
用户编号由user1表示。输出是使用different_months和orig_different_months列生成的window_num。different_months列是日期[n]和日期[n+1]之间的差值。
以前,我使用groupby.apply来输出window_num,但当数据集增加时,它变得非常慢。通过在整个数据集上使用移位函数来计算different_months和orig_different_months列,以及在整个数据集中应用排序,代码得到了显著改进,如下所示:
data = data.sort_values(by=['user','ContractInceptionDateClean'], ascending=[True,True])
#data['user1'] =data['user']
data['different_months'] = (abs((data['ContractInceptionDateClean'].shift(-1)-data['ContractInceptionDateClean'] ).dt.days)).fillna(0)
data.different_months[data['different_months'] < 91] =0
data['shift_different_months']=data['different_months'].shift(1)
data['org_different_months']=data['different_months']
data.loc[((data['different_months'] == 0) | (data['shift_different_months'] == 0)),'different_months']=0
data = salesswindow_cal(data,list(data.user.unique()))
我目前正在努力提高速度的代码如下所示:
def salesswindow_cal(data_,users):
temp = pd.DataFrame()
for u in range(0,len(users)):
df=data_[data_['user']==users[u]]
df['different_months'].values[0]= df['org_different_months'].values[0]
df['window_num']=(df['different_months'].diff() != 0).cumsum()
temp= pd.concat([df,temp],axis=0)
return pd.DataFrame(temp)
经验法则是不要在用户之间循环并提取df = data_[data_['user']==user]
。而是执行groupby
:
for u, df in data_.gropuby('user'):
do_some_stuff
另一个问题是不迭代地连接数据
data_out = []
for user, df in data.groupby('user'):
do_some_stuff
data_out.append(sub_data)
out = pd.concat(data_out)
在您的情况下,您可以执行一个函数,groupby().apply()
和panda将为您连接数据。
def group_func(df):
d = df.copy()
d['different_months'].values[0] = d['org_different_months'].value[0]
d['window_num'] = (d['different_months'].diff().ne(0).cumsum()
return d
data.groupby('user').apply(group_func)
更新:
让我们尝试这种矢量化方法,它修改您的数据就地
# update the first `different_months`
mask = ~data['user'].duplicated()
data.loc[mask, 'different_months'] == data.loc[mask, 'orginal_different_months']
groups = data.groupby('user')
data['diff'] = groups['different_months'].diff().ne(0)
data['window_num'] = groups['diff'].cumsum()