根据另一列pandas查找前5个值



在Panda数据帧中对customer_id列进行分组时,如何找到category列的前5个值?

customer_id     email                        address_id  name              category
0         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Scifi
1         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Action
2         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Sports
3         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Scifi
4         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Family
5         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Action
6         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Documentary
7         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Action
8         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Sports
9         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Scifi
10        527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Documentary
11        527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Sports

对于每个唯一的customer_id,我想要另一个名为preferred_film_category的列(前5个值基于每个唯一的customer_id出现特定类别的次数(

预期数据帧:

customer_id     email                     address_id    name       category      preferred_film_category   
0       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Scifi        Scifi, Action, Sports, Animation, Drama    
1       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Action       Scifi, Action, Sports, Animation, Drama 
2       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Sports       Scifi, Action, Sports, Animation, Drama 
3       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Scifi        Scifi, Action, Sports, Animation, Drama 
4       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Family       Scifi, Action, Sports, Animation, Drama 
5       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Action       Scifi, Action, Sports, Animation, Drama
6       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Documentary  Documentary, Sports, Scifi, Action 
7       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Action       Documentary, Sports, Scifi, Action 
8       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Sports       Documentary, Sports, Scifi, Action 
9       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Scifi        Documentary, Sports, Scifi, Action 
10      527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Documentary  Documentary, Sports, Scifi, Action 
11      527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Sports       Documentary, Sports, Scifi, Action 

尝试value_counts+groupby nlargest以获得最高频率的类别,然后groupby aggregate转换为字符串,再join与原始DataFrame:合并

n = 2
df = df.join(
df.value_counts(['customer_id', 'category'])
.groupby(level=0).nlargest(n)
.reset_index('category')
.groupby(level=0)['category'].agg(', '.join)
.rename('preferred_film_category'),
on='customer_id'
)

df:

customer_id                              email  address_id            name     category preferred_film_category
0           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi           Action, Scifi
1           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action           Action, Scifi
2           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Sports           Action, Scifi
3           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi           Action, Scifi
4           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Family           Action, Scifi
5           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action           Action, Scifi
6           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary     Documentary, Sports
7           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Action     Documentary, Sports
8           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports     Documentary, Sports
9           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN        Scifi     Documentary, Sports
10          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary     Documentary, Sports
11          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports     Documentary, Sports

*注意n被设置为2,因为每个客户在categrory中只有4个唯一值,因此5不演示代码的功能。将其更改为要保留的所需值(5(。


使用的数据帧:

df = pd.DataFrame({
'customer_id': [411, 411, 411, 411, 411, 411, 527, 527, 527, 527, 527, 527],
'email': ['NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org'],
'address_id': [416, 416, 416, 416, 416, 416, 533, 533, 533, 533, 533, 533],
'name': ['NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN',
'CORY MEEHAN', 'CORY MEEHAN'],
'category': ['Scifi', 'Action', 'Sports', 'Scifi', 'Family', 'Action',
'Documentary', 'Action', 'Sports', 'Scifi', 'Documentary',
'Sports']
})

df:

customer_id                              email  address_id            name     category
0           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi
1           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action
2           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Sports
3           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi
4           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Family
5           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action
6           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary
7           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Action
8           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports
9           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN        Scifi
10          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary
11          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports

最新更新