使用Python Pandas的数据帧,如何调整显示并对顺序日志行应用计算?



寻求使用Python和Pandas破解以下2个任务的帮助。

下面是输入顺序日志表:

user_id token   action_id   action_timestamp
7       223     1           timestamp1
12      191     1           timestamp2
45      667     2           timestamp3
7       223     3           timestamp4
12      191     2           timestamp5
12      339     1           timestamp6
7       223     2           timestamp7
12      339     2           timestamp8
7       564     1           timestamp9
12      778     1           timestamp10
91      551     1           timestamp11
12      778     4           timestamp12
12      778     5           timestamp13
91      551     5           timestamp14
91      551     3           timestamp15
91      551     2           timestamp16
45      122     1           timestamp17

第一个期望的结果应该只选择那些具有action_id 1和action_id 2的令牌的行,并将它们的时间戳显示为列-每个用户,每个令牌:

user_id token   timestamp_action_id_1   timestamp_action_id_2
7       223     timestamp1              timestamp7
12      191     timestamp2              timestamp5
12      339     timestamp6              timestamp8
91      551     timestamp11             timestamp16

第二个期望的结果是计算所有令牌从action_id 1到action_id 2的平均时间,每个用户:

user_id action_id_1_to_action_id_2_time_delta_average
7       <avg of time delta for token 223>
12      <avg of time delta for tokens 191 and 339>
91      <avg of time delta for token 551>

提前感谢!


更新:下面是实现mozway答案的代码:

df = pd.DataFrame({
'user_id': [7, 12, 45, 7, 12, 12, 7, 12, 7, 12, 91, 12, 12, 91, 91, 91, 45],
'token': [223, 191, 667, 223, 191, 339, 223, 339, 564, 778, 551, 778, 778, 551, 551, 551, 122],
'action_id': [1, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 4, 5, 5, 3, 2, 1],
'action_timestamp': [f'timestamp{x}' for x in range(1,18)]
})
# For all columns
df.pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_').reset_index().rename_axis(None, axis=1)
# Only for the desired columns 
df2 = df[df['action_id'].isin([1,2])].pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_').reset_index().rename_axis(None, axis=1)
df3 = df2[~df2.isnull().any(axis=1)].reset_index(drop=True)
df3
user_id     token   timestamp_action_id_1   timestamp_action_id_2
0   7   223     timestamp1  timestamp7
1   12  191     timestamp2  timestamp5
2   12  339     timestamp6  timestamp8
3   91  551     timestamp11     timestamp16

但是,如果日志表在令牌中对用户进行重复操作,则会出现"索引包含重复条目,不能重塑"错误。下面是带有重复动作的表:

df = pd.DataFrame({
'user_id': [7, 12, 45, 7, 12, 12, 7, 12, 7, 12, 91, 12, 12, 91, 91, 91, 91, 45],
'token': [223, 191, 667, 223, 191, 339, 223, 339, 564, 778, 551, 778, 778, 551, 551, 551, 551, 122],
'action_id': [1, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 4, 5, 5, 3, 5, 2, 1],
'action_timestamp': [f'timestamp{x}' for x in range(1,19)]
})

对于第一步pivot,您的数据框架:

df.pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_')

第二个过滤器,pivot,计算增量,groupby user_id并取平均值:

(df[df['action_id'].isin([1,2])]
.pivot(index=['user_id', 'token'],
columns='action_id',
values='action_timestamp')
.add_prefix('timestamp_action_id_')
.assign(delta=lambda d:d['timestamp_action_id_2']-d['timestamp_action_id_1'])
.groupby('user_id')['delta'].apply(lambda x: x.dt.total_seconds().mean())
.dropna()
)

输出(使用随机时间戳):

user_id
7     813000.0
12    -29000.0
91   -601000.0

注意:不幸的是,我现在不能测试/运行代码,所以认为它是伪代码

相关内容

  • 没有找到相关文章

最新更新