我有以下pandas
数据帧df
:
cluster tag amount name
1 0 200 Michael
2 1 1200 John
2 1 900 Daniel
2 0 3000 David
2 0 600 Jonny
3 0 900 Denisse
3 1 900 Mike
3 1 3000 Kely
3 0 2000 Devon
我需要做的是在df
中添加为每个row
写入的另一列,即具有最高amount
的name
(来自名称列(,其中tag
为1。换句话说,解决方案是这样的:
cluster tag amount name highest_amount
1 0 200 Michael NaN
2 1 1200 John John
2 1 900 Daniel John
2 0 3000 David John
2 0 600 Jonny John
3 0 900 Denisse Kely
3 1 900 Mike Kely
3 1 3000 Kely Kely
3 0 2000 Devon Kely
我试过这样的东西:
df.group('clusters')['name','amount'].transform('max')[df['tag']==1]
但问题是,名称在每一行都有重复。它看起来是这样的:
cluster tag amount name highest_amount
1 0 200 Michael NaN
2 1 1200 John John
2 1 900 Daniel John
2 0 3000 David NaN
2 0 600 Jonny NaN
3 0 900 Denisse NaN
3 1 900 Mike Kely
3 1 3000 Kely Kely
3 0 2000 Devon NaN
有人能告诉我如何用拆分-应用-组合添加条件,并在每行上重复解决方案吗?
您可以将此过程分为两个阶段。首先计算映射序列,然后按簇进行映射:
s = df.query('tag == 1')
.sort_values('amount', ascending=False)
.drop_duplicates('cluster')
.set_index('cluster')['name']
df['highest_name'] = df['cluster'].map(s)
print(df)
cluster tag amount name highest_name
0 1 0 200 Michael NaN
1 2 1 1200 John John
2 2 1 900 Daniel John
3 2 0 3000 David John
4 2 0 600 Jonny John
5 3 0 900 Denisse Kely
6 3 1 900 Mike Kely
7 3 1 3000 Kely Kely
8 3 0 2000 Devon Kely
如果你想使用groupby
,这里有一种方法:
def func(x):
names = x.query('tag == 1').sort_values('amount', ascending=False)['name']
return names.iloc[0] if not names.empty else np.nan
df['highest_name'] = df['cluster'].map(df.groupby('cluster').apply(func))