聚合为最多包含两个元素的列表



给定如下用户表:

user       query
0    a1      orange
1    a1  strawberry
2    a1        pear
3    a2      orange
4    a2  strawberry
5    a2       lemon
6    a3      orange
7    a3      banana
8    a6        meat
9    a7        beer
10   a8       juice

我想按user分组,并聚合为query的列表,如果超过两个,则选择前两个项目,预期结果为

user                        query
0   a1         [orange, strawberry]
1   a2         [orange, strawberry]
2   a3             [orange, banana]
3   a6                       [meat]
4   a7                       [beer]
5   a8                      [juice]

代码低于

df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 
4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 
8: 'a6', 9: 'a7', 10: 'a8'}, 
'query': {0: 'orange', 1: 'strawberry', 
2: 'pear', 3: 'orange', 4: 'strawberry', 
5: 'lemon', 6: 'orange', 7: 'banana', 
8: 'meat', 9: 'beer', 10: 'juice'}} )
print(df_user.groupby(['user'], as_index=False).agg(list))

我设法得到

user                        query
0   a1   [orange, strawberry, pear]
1   a2  [orange, strawberry, lemon]
2   a3             [orange, banana]
3   a6                       [meat]
4   a7                       [beer]
5   a8                      [juice]

什么是达到预期结果的好方法?

这里有一种方法:

out = df[df.groupby('user').cumcount()<2].groupby('user', as_index=False).agg(list)

输出:

user                 query
0   a1  [orange, strawberry]
1   a2  [orange, strawberry]
2   a3      [orange, banana]
3   a6                [meat]
4   a7                [beer]
5   a8               [juice]
​

您可以使用iloc来分割多达2个项目:

df_user.groupby(['user'], as_index=False).agg(lambda s: s.iloc[:2].to_list())

输出:

user                 query
0   a1  [orange, strawberry]
1   a2  [orange, strawberry]
2   a3      [orange, banana]
3   a6                [meat]
4   a7                [beer]
5   a8               [juice]

您可以使用groupby+nth()按索引(如果存在(从每组中选择元素:

new_df = df.groupby('user').nth([0, 1]).groupby(level=0).agg(list)

输出:

>>> new_df
query
user                      
a1    [orange, strawberry]
a2    [orange, strawberry]
a3        [orange, banana]
a6                  [meat]
a7                  [beer]
a8                 [juice]

请注意,如果您不想键入所有这些数字,list(range(2))将比[0, 1]更动态:(

最新更新