我有一个包含以下列的数据框架
df1:
Task Start Date | Task Finish Date | Task Role
01-01-2021 01-03-2021 Lead
01-04-2021 02-02-2021 Team member
01-04-2021 02-23-2021 Unknown
我想根据以下条件创建另一个列'Origin Role':
如果任务角色是lead/其他,则跳过(将相同的任务角色填充到新列('Origin Role');
else仅当Task Role为Unknown时:
考虑任务角色为"Unknown"的记录的任务开始日期,并查找其任务完成日期<=(小于或等于)任务开始日期(未知角色)的最新记录(如果有多个)。
然后在新列'Origin role '中填写最新角色的Task角色,取代未知角色。
我的期望:
df1:
Task Start Date | Task Finish Date | Task Role | Origin Role
01-01-2021 01-03-2021 Lead Lead
01-04-2021 02-02-2021 Team member Team Member
01-04-2021 02-23-2021 Unknown Lead
一个选择是使用merge_asof
:
# ensure datetime
df[['Task Start Date', 'Task Finish Date']] =
df[['Task Start Date', 'Task Finish Date']].apply(pd.to_datetime, dayfirst=False)
df['Origin Role'] = df['Task Role'].replace({'Unknown': np.nan}).fillna(
pd.merge_asof(df['Task Start Date'].sort_values().reset_index(),
df[['Task Finish Date', 'Task Role']]
.loc[lambda d: d['Task Role'].ne('Unknown')]
.sort_values(by='Task Finish Date'),
left_on='Task Start Date', right_on='Task Finish Date',
).set_index('index')['Task Role']
)
步骤:
replace
"未知";由南- 按日期排序
- 在最近的日期合并,忽略"未知">
fillna
与此值 输出:
Task Start Date Task Finish Date Task Role Origin Role
0 2021-01-01 2021-01-03 Lead Lead
1 2021-01-04 2021-02-02 Team member Team member
2 2021-01-04 2021-02-23 Unknown Lead