我有一个数据帧,我试图计算两个不同主题之间的时差,同时在呼叫中保持不变,而不是溢出到一个新的呼叫(即同时确保它没有计算出不同呼叫中主题之间的时差)。其中interaction_id是一个单独的调用
这是一个示例Dataframe
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
interaction_id start_time topic
1 2 Cost
1 5.72 NaN
1 8.83 Billing
1 12.86 NaN
2 2 Cost
2 6.75 NaN
2 8.54 NaN
3 1.5 Payments
3 3.65 Products
这是期望输出
df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length'])
interaction_id start_time topic topic_length
1 2 Cost 6.83
1 5.72 NaN NaN
1 8.83 Billing 4.03
1 12.86 NaN NaN
2 2 Cost 6.54
2 6.75 NaN NaN
2 8.54 NaN NaN
3 1.5 Payments 2.15
3 3.65 Products ....
我不知道是否有更简单的解决方法,但是这个方法可以解决你的问题:
def custom_agg(group):
group = group.reset_index(drop=True)
max_ind = group.shape[0]-1
current_ind = -1
current_val = None
for ind, val in group.iterrows():
if pd.isna(val.topic) and ind != max_ind:
continue
if current_ind == -1:
current_ind = ind
current_val = val["start_time"]
else:
group.loc[current_ind,"topic_length"] = val["start_time"] - current_val
current_ind = ind
current_val = val["start_time"]
return group
df = df.sort_values(by=['interaction_id', 'start_time']).groupby("interaction_id").apply(custom_agg).reset_index(drop=True)
输出:
interaction_id start_time topic topic_length
0 1 2.00 Cost 6.83
1 1 5.72 NaN NaN
2 1 8.83 Billing 4.03
3 1 12.86 NaN NaN
4 2 2.00 Cost 6.54
5 2 6.75 NaN NaN
6 2 8.54 NaN NaN
7 3 1.50 Payments 2.15
8 3 3.65 Products NaN