我正试图想出一种动态方法来检查字符串的存在,并报告一些不同的结果:gone_client、boomerang、new_client。
-
如果我按address_id和my_date分组,并且模式是Verizon、Verizon、Comcast、Comcast,则客户端离开Verizon,转到另一家公司。
-
如果客户从Verizon转到康卡斯特,然后又回到Verizon,这就是一个回旋镖。
-
如果该客户上个月是Verizon的新客户,那么这就是一个新客户,但如果该客户在过去的某个时候是Verizon的,而上个月又是新客户,则这是一个回旋镖。
进口熊猫作为pd将numpy导入为np
# data stored in dictionary details = { 'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777], 'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'], 'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28'] } df = pd.DataFrame(details) df
我不能很好地理解逻辑,但我认为它是沿着这些线的。。。但似乎有点不对劲。
def f(ser):
if 'Verizon' not in ser.unique():
return False
if 'Verizon' in ser.unique():
return True
return False
df['gone_client'] = df.groupby('address_id')['my_company'].transform(f)
df.head()
也许是这样的。。。
# substring to be searched
sub ='Verizon'
# creating and passing series to new column
df["Indexes"]= df["my_company"].str.find(sub)
df
不过,我不知道如何处理这些日期,只有检查这些日期,你才能知道一个客户是否已经走了,是否是新客户,或者可能是一些不适合放在桶里的奇怪模式,比如"其他"。预期的结果是这样的(最后一种情况有点像回旋镖,又一次离开了(。我预计不会有很多这样的情况,但肯定会有一些非常独特的情况,所以"其他"对这些记录来说很好。
address_id my_company my_date status
0 111 Comcast 1/24/2022 Left
1 111 Verizon 2/21/2022 Left
2 111 Other 3/28/2022 Left
3 111 Other 4/25/2022 Left
4 111 Comcast 5/23/2022 Left
5 111 Comcast 6/27/2022 Left
6 222 Spectrum 1/24/2022 Never Client
7 222 Spectrum 2/21/2022 Never Client
8 222 Spectrum 3/28/2022 Never Client
9 222 Spectrum 4/25/2022 Never Client
10 222 Spectrum 5/23/2022 Never Client
11 222 Spectrum 6/27/2022 Never Client
12 333 Verizon 1/24/2022 Never Left
13 333 Verizon 2/21/2022 Never Left
14 333 Verizon 3/28/2022 Never Left
15 333 Verizon 4/25/2022 Never Left
16 333 Verizon 5/23/2022 Never Left
17 333 Verizon 6/27/2022 Never Left
18 444 Spectrum 1/24/2022 Left
19 444 Spectrum 2/21/2022 Left
20 444 Spectrum 3/28/2022 Left
21 444 Spectrum 4/25/2022 Left
22 444 Verizon 5/23/2022 Left
23 444 Spectrum 6/27/2022 Left
24 555 Verizon 1/24/2022 Boomerang
25 555 Spectrum 2/21/2022 Boomerang
26 555 Spectrum 3/28/2022 Boomerang
27 555 Spectrum 4/25/2022 Boomerang
28 555 Verizon 5/23/2022 Boomerang
29 555 Verizon 6/27/2022 Boomerang
30 777 Comcast 1/24/2022 New Client
31 777 Comcast 2/21/2022 New Client
32 777 Verizon 1/24/2022 New Client
33 888 Verizon 2/21/2022 Other
34 888 Comcast 3/28/2022 Other
35 888 Verizon 4/25/2022 Other
36 888 Comcast 5/23/2022 Other
37 888 Comcast 6/27/2022 Other
这里有一个包含一些详细逻辑的解决方案,您可以使用它。听起来你不太确定自己的最终逻辑,但希望这能给你足够的机会。
该解决方案基于address_id
对数据帧进行分组。然后,对于每个单独的组,我们可以检查哪个my_company
特征。我们可以使用这个,再加上timedelta
,来确定地址是否在Verizon,而不是Verizon,从未在Verizon,目前在Verizon,并在过去30天内回来,等等。
此答案不是由Verizon赞助的。存在其他蜂窝式提供商
import pandas as pd
import datetime
# data stored in dictionary
details = {
'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777],
'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'],
'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28']
}
df = pd.DataFrame(details)
df['my_date'] = pd.to_datetime(df['my_date'])
address_groups = df.groupby(['address_id'])
frame_list = []
current_date = datetime.datetime.now()
for group, frame in address_groups:
# Create a list and set of each company used by a given address-id:
company_list = frame['my_company'].values.tolist()
company_set = set(company_list)
# Exclusively Verizon
if ('Verizon' in company_set) and (len(company_set) == 1):
frame['status'] = 'Verizon Diehard'
# Never Verizon
if ('Verizon' not in company_set):
frame['status'] = 'Verizon Never'
# Verizon at some point but not currently
if ('Verizon' in company_set) and (company_list[-1] != 'Verizon'):
v_frame = frame[frame['my_company'] == 'Verizon']
last_verizon_date = v_frame['my_date'].iloc[-1]
last_verizon_date = datetime.datetime.strptime(last_verizon_date, '%Y-%m-%d')
if (current_date - last_verizon_date) < pd.Timedelta("30 days"):
frame['status'] = 'Not curretly Verizon, but was in last 30 days'
else:
frame['status'] = 'Not curretly Verizon, but was so more than 30 days ago'
# Verizon currently but was a boomerang
if (company_list[-1] == 'Verizon') and (len(company_set) >= 2):
non_v_frame = frame[frame['my_company'] != 'Verizon']
last_non_v_date = non_v_frame['my_date'].iloc[-1]
last_non_v_date = datetime.datetime.strptime(last_non_v_date, '%Y-%m-%d')
if (current_date - last_non_v_date) < pd.Timedelta("30 days"):
frame['status'] = 'Boomerang back to Verizon in last 30 days'
else:
frame['status'] = 'Boomerang back more than 30 days ago'
frame_list.append(frame)
final_df = pd.concat(frame_list)