我有一个pandas DataFrame,如下所示:
import pandas as pd
df = pd.DataFrame({'group_id': [1,1,2,2],
'name':['Arthur','Bob','Caroline','Denise'],
'income': [40000, 20000,50000,60000]
})
df
Out[94]:
group_id name income
0 1 Arthur 40000
1 1 Bob 20000
2 2 Caroline 50000
3 2 Denise 60000
我想要的输出是,在group_id中,有收入最高的人的名字,例如:
df
Out[94]:
group_id name income highest_income_name
0 1 Arthur 40000 Arthur
1 1 Bob 20000 Arthur
2 2 Caroline 50000 Denise
3 2 Denise 60000 Denise
根据我的实际数据的数据生成过程,在一个收入最高的组中总是只有一个名字。
生成上述内容的最佳实践方法是什么?
如果我尝试填写最大收入,然后找到名称,我就会被困在NaN中,我可能会尝试填写,但会增加复杂性。
df['max_income'] = df.groupby('group_id')['income'].transform('max')
df['highest_income_name'] = df['name'][df['income']==df['max_income']]
df
Out[105]:
group_id name income max_income highest_income_name
0 1 Arthur 40000 40000 Arthur
1 1 Bob 20000 40000 NaN
2 2 Caroline 50000 60000 NaN
3 2 Denise 60000 60000 Denise
使用numpy.where
与Groupby.transform
:
In [287]: import numpy as np
In [302]: df['highest_income_name'] = np.where(df.income.eq(df.groupby('group_id')['income'].transform(max)), df.name, np.nan)
In [308]: df['highest_income_name'] = df.groupby('group_id')['highest_income_name'].transform('first')
In [309]: df
Out[309]:
group_id name income highest_income_name
0 1 Arthur 40000 Arthur
1 1 Bob 20000 Arthur
2 2 Caroline 50000 Denise
3 2 Denise 60000 Denise