如何在Python中查找和聚合替代其他观察结果



上下文:

  • 销售合作伙伴注册了一个机会(OppID(,如果未关闭,该机会将在180天后过期
  • 当它到期时,同一个销售伙伴注册了一个新的OppID,但它实际上是过去OppID的延续
  • 在一个手动自由文本字段(Notes(中,有一个对过去OppID的引用,其中包含类似"的字符串;替换">

样本数据集:

df <- data.frame(OppID=c("A123","A765","B456","C137","F879","H987"),OppDate=c("1/20/2020","1/21/2020","7/21/2020","1/4/2020","11/1/2020","8/21/2020"),OppStatus=c("Closed","Closed","Open","Closed","Open","Closed"),Notes=c("","","Replaces A123","","Replaces B456","Replaces A765"))

这是生成的数据帧(df(:

head(df)
OppID OppDate   OppStatus Notes
1 A123  1/20/2020 Closed    
2 A765  1/21/2020 Closed   
3 B456  7/21/2020 Open      Refers to A123
4 C137  1/4/2020  Closed    
5 F879  11/1/2020 Open      Refers to B456
6 H987  8/21/2020 Open      Refers to A765

我需要以编程方式完成的是这样的事情(一个新的数据帧"df2"(:

head(df2)
OppID OppDate   OppStatus Notes               FirstOppDate
1 C137  1/4/2020  Closed    
2 F879  11/1/2020 Open      Refers to A123,B456 1/20/2020
3 H987  8/21/2020 Open      Refers to A765      1/21/2020

正如你在df2中看到的,OppID A123和B456应该移到OppID F879(因为F879是B456的延续,B456是A123的延续(。应该创建一个新列来记录最旧OppID的OppDate(在本例中,A123早在2020年1月20日(。

类似的情况也发生在H987(A765的延续(上。最后,C137没有改变,因为这个OppID不是以前任何OppID的延续。

我试着想办法做到这一点,但到目前为止没有成功。我知道如何从自由文本字段中提取OppID,但无法想办法检查这种关系并将其聚合到最新的OppID中。

有什么想法吗?我希望我想要达到的目标是有意义的(不是一个以英语为母语的人(。非常感谢!

import pandas as pd
columns = ['OppID', 'OppDate', 'OppStatus', 'Notes']
rows = [['A123', '1/20/2020', 'Closed' ,''], 
['A765', '1/21/2020', 'Closed', ''],
['B456', '7/21/2020', 'Open', 'Refers to A123'], 
['C137', '1/4/2020' , 'Closed', ''],
['F879', '11/1/2020', 'Open' ,'Refers to B456'],
['H987', '8/21/2020', 'Open', 'Refers to A765']]
df = pd.DataFrame(rows, columns = columns)
# You can use a regular expression that suits better
df['ref_opp_id'] = [x.split()[-1] if len(x)>0 else None for x in df['Notes']] 
# This can be parallelized or can be further optimized
total_ref_opps = []
first_opp_dates = []
for index, row in df.iterrows():
total = []
final_opp_id = row['ref_opp_id']
first_opp_date = None
while final_opp_id is not None:
total.append(final_opp_id)
first_opp_date = df[df['OppID'] == final_opp_id]['OppDate'].values[0]
final_opp_id = df[df['OppID'] == final_opp_id]['ref_opp_id'].values[0]
total_ref_opps.append(total)
first_opp_dates.append(first_opp_date)
df['total_ref_opps'] = total_ref_opps
df['first_opp_dates'] = first_opp_dates
all_dup_items = [item for sublist in total_ref_opps for item in sublist]
df_new = df[~df['OppID'].isin(all_dup_items)].copy().reset_index(drop=True)
df_new.head()

我是一个新的贡献者。如果这是你想要的,请将这个答案标记为成功。

最新更新