在Panda数据帧中对customer_id
列进行分组时,如何找到category
列的前5个值?
customer_id email address_id name category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
对于每个唯一的customer_id
,我想要另一个名为preferred_film_category
的列(前5个值基于每个唯一的customer_id出现特定类别的次数(
预期数据帧:
customer_id email address_id name category preferred_film_category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Scifi, Action, Sports, Animation, Drama
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Scifi, Action, Sports, Animation, Drama
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports Scifi, Action, Sports, Animation, Drama
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Scifi, Action, Sports, Animation, Drama
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family Scifi, Action, Sports, Animation, Drama
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Scifi, Action, Sports, Animation, Drama
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports, Scifi, Action
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action Documentary, Sports, Scifi, Action
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports, Scifi, Action
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi Documentary, Sports, Scifi, Action
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports, Scifi, Action
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports, Scifi, Action
尝试value_counts
+groupby nlargest
以获得最高频率的类别,然后groupby aggregate
转换为字符串,再join
与原始DataFrame:合并
n = 2
df = df.join(
df.value_counts(['customer_id', 'category'])
.groupby(level=0).nlargest(n)
.reset_index('category')
.groupby(level=0)['category'].agg(', '.join)
.rename('preferred_film_category'),
on='customer_id'
)
df
:
customer_id email address_id name category preferred_film_category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Action, Scifi
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Action, Scifi
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports Action, Scifi
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Action, Scifi
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family Action, Scifi
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Action, Scifi
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action Documentary, Sports
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi Documentary, Sports
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports
*注意n
被设置为2
,因为每个客户在categrory
中只有4个唯一值,因此5
不演示代码的功能。将其更改为要保留的所需值(5
(。
使用的数据帧:
df = pd.DataFrame({
'customer_id': [411, 411, 411, 411, 411, 411, 527, 527, 527, 527, 527, 527],
'email': ['NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org'],
'address_id': [416, 416, 416, 416, 416, 416, 533, 533, 533, 533, 533, 533],
'name': ['NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN',
'CORY MEEHAN', 'CORY MEEHAN'],
'category': ['Scifi', 'Action', 'Sports', 'Scifi', 'Family', 'Action',
'Documentary', 'Action', 'Sports', 'Scifi', 'Documentary',
'Sports']
})
df
:
customer_id email address_id name category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports