寻求使用Python和Pandas破解以下2个任务的帮助。
下面是输入顺序日志表:
user_id token action_id action_timestamp
7 223 1 timestamp1
12 191 1 timestamp2
45 667 2 timestamp3
7 223 3 timestamp4
12 191 2 timestamp5
12 339 1 timestamp6
7 223 2 timestamp7
12 339 2 timestamp8
7 564 1 timestamp9
12 778 1 timestamp10
91 551 1 timestamp11
12 778 4 timestamp12
12 778 5 timestamp13
91 551 5 timestamp14
91 551 3 timestamp15
91 551 2 timestamp16
45 122 1 timestamp17
第一个期望的结果应该只选择那些具有action_id 1和action_id 2的令牌的行,并将它们的时间戳显示为列-每个用户,每个令牌:
user_id token timestamp_action_id_1 timestamp_action_id_2
7 223 timestamp1 timestamp7
12 191 timestamp2 timestamp5
12 339 timestamp6 timestamp8
91 551 timestamp11 timestamp16
第二个期望的结果是计算所有令牌从action_id 1到action_id 2的平均时间,每个用户:
user_id action_id_1_to_action_id_2_time_delta_average
7 <avg of time delta for token 223>
12 <avg of time delta for tokens 191 and 339>
91 <avg of time delta for token 551>
提前感谢!
更新:下面是实现mozway答案的代码:
df = pd.DataFrame({
'user_id': [7, 12, 45, 7, 12, 12, 7, 12, 7, 12, 91, 12, 12, 91, 91, 91, 45],
'token': [223, 191, 667, 223, 191, 339, 223, 339, 564, 778, 551, 778, 778, 551, 551, 551, 122],
'action_id': [1, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 4, 5, 5, 3, 2, 1],
'action_timestamp': [f'timestamp{x}' for x in range(1,18)]
})
# For all columns
df.pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_').reset_index().rename_axis(None, axis=1)
# Only for the desired columns
df2 = df[df['action_id'].isin([1,2])].pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_').reset_index().rename_axis(None, axis=1)
df3 = df2[~df2.isnull().any(axis=1)].reset_index(drop=True)
df3
user_id token timestamp_action_id_1 timestamp_action_id_2
0 7 223 timestamp1 timestamp7
1 12 191 timestamp2 timestamp5
2 12 339 timestamp6 timestamp8
3 91 551 timestamp11 timestamp16
但是,如果日志表在令牌中对用户进行重复操作,则会出现"索引包含重复条目,不能重塑"错误。下面是带有重复动作的表:
df = pd.DataFrame({
'user_id': [7, 12, 45, 7, 12, 12, 7, 12, 7, 12, 91, 12, 12, 91, 91, 91, 91, 45],
'token': [223, 191, 667, 223, 191, 339, 223, 339, 564, 778, 551, 778, 778, 551, 551, 551, 551, 122],
'action_id': [1, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 4, 5, 5, 3, 5, 2, 1],
'action_timestamp': [f'timestamp{x}' for x in range(1,19)]
})
对于第一步pivot
,您的数据框架:
df.pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_')
第二个过滤器,pivot,计算增量,groupby user_id并取平均值:
(df[df['action_id'].isin([1,2])]
.pivot(index=['user_id', 'token'],
columns='action_id',
values='action_timestamp')
.add_prefix('timestamp_action_id_')
.assign(delta=lambda d:d['timestamp_action_id_2']-d['timestamp_action_id_1'])
.groupby('user_id')['delta'].apply(lambda x: x.dt.total_seconds().mean())
.dropna()
)
输出(使用随机时间戳):
user_id
7 813000.0
12 -29000.0
91 -601000.0
注意:不幸的是,我现在不能测试/运行代码,所以认为它是伪代码