我有一个数据框my_df
,然后我想创建一个新的数据框new_df
。每个new_df
列都是由groupby
my_id
创建的,然后在my_df
中获取列的max
。
下面是我的代码,它工作正常。但是,我想知道有没有更好的方法?特别是将来我将处理数百列而不是仅 6 列?多谢!
tmp_df1 = my_df.groupby(['my_id'], as_index=False).col_A.agg({"max_A": "max"})
tmp_df2 = my_df.groupby(['my_id'], as_index=False).col_B.agg({"max_B": "max"})
tmp_df3 = my_df.groupby(['my_id'], as_index=False).col_C.agg({"max_C": "max"})
tmp_df4 = my_df.groupby(['my_id'], as_index=False).col_D.agg({"max_D": "max"})
tmp_df5 = my_df.groupby(['my_id'], as_index=False).col_E.agg({"max_E": "max"})
tmp_df6 = my_df.groupby(['my_id'], as_index=False).col_F.agg({"max_F": "max"})
combine_df1 = pd.merge(tmp_df1,tmp_df2,how="inner",on=['my_id'])
combine_df2 = pd.merge(combine_df1,tmp_df3,how="inner",on=['my_id'])
combine_df3 = pd.merge(combine_df2,tmp_df4,how="inner",on=['my_id'])
combine_df4 = pd.merge(combine_df3,tmp_df5,how="inner",on=['my_id'])
new_df = pd.merge(combine_df4,tmp_df6,how="inner",on=['my_id'])
考虑示例数据帧my_df
np.random.seed([3,1415])
my_df = pd.DataFrame(dict(
my_id=list('WXYZ') * 4,
col_A=np.random.randint(10, size=16),
col_B=np.random.randint(10, size=16),
col_C=np.random.randint(10, size=16),
col_D=np.random.randint(10, size=16),
col_E=np.random.randint(10, size=16),
col_F=np.random.randint(10, size=16),
))
my_df
col_A col_B col_C col_D col_E col_F my_id
0 0 3 6 1 9 3 W
1 2 2 4 5 8 5 X
2 7 4 7 2 0 8 Y
3 3 3 6 8 4 5 Z
4 8 3 2 2 3 1 W
5 7 6 6 4 0 5 X
6 0 7 6 7 4 1 Y
7 6 7 5 6 1 4 Z
8 8 4 2 9 5 3 W
9 6 5 8 4 8 9 X
10 0 3 7 2 6 5 Y
11 2 7 5 4 0 5 Z
12 0 5 8 6 8 7 W
13 4 9 4 3 7 0 X
14 9 8 7 8 4 3 Y
15 7 7 6 3 6 2 Z
您只想使用groupby
和max
new_df = my_df.groupby('my_id', as_index=False).max()
new_df
my_id col_A col_B col_C col_D col_E col_F
0 W 8 5 8 9 9 7
1 X 7 9 8 5 8 9
2 Y 9 8 7 8 6 8
3 Z 7 7 6 8 6 5
如果要更改列的名称:
my_df.groupby('my_id').max()
.rename(columns=lambda x: x.replace('col_', ''))
.add_suffix('_max').reset_index()
my_id A_max B_max C_max D_max E_max F_max
0 W 8 5 8 9 9 7
1 X 7 9 8 5 8 9
2 Y 9 8 7 8 6 8
3 Z 7 7 6 8 6 5
或
my_df.groupby('my_id').max()
.rename(columns=lambda x: x.rsplit('_', 1)[1] + '_max').reset_index()
my_id A_max B_max C_max D_max E_max F_max
0 W 8 5 8 9 9 7
1 X 7 9 8 5 8 9
2 Y 9 8 7 8 6 8
3 Z 7 7 6 8 6 5