正在获取groupby中的相应值

  • 本文关键字:获取 groupby python pandas
  • 更新时间 :
  • 英文 :


我有一个类似于这个的数据集

Serial    A        B       
1         12               
1         31
1
1         12
1         31       203
1         10    
1         2
2         32       100     
2         32       242
2         3
3         2                
3         23       100     
3
3         23

我根据Serial对数据帧进行分组,通过df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values找到每个A列的最大值,并通过df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')保留第一个值

Serial    A        B       A_MAX    B_corresponding
1         12               31       203
1         31
1
1         12
1         31       203
1         10    
1         2
2         32       100     32       100
2         32       242
2         3
3         2                23       100
3         23       100     
3
3         23

现在对于B_corresponding列,我想获得A_MAX的相应B值。我想在A中定位A_MAX值,但每组有相似的最大A值。附加条件,例如在Serial 2中,我也希望获得32之间的最小B

想法是使用DataFrame.sort_values获得每组的最大值,然后通过DataFrame.dropna删除缺失值,并通过Serial通过DataFrame.drop_duplicates获得第一行。由DataFrame.set_index创建Series,最后使用Series.map:

df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial     A      B  A_MAX  B_corresponding
0        1  12.0    NaN   31.0            203.0
1        1  31.0    NaN    NaN              NaN
2        1   NaN    NaN    NaN              NaN
3        1  12.0    NaN    NaN              NaN
4        1  31.0  203.0    NaN              NaN
5        1  10.0    NaN    NaN              NaN
6        1   2.0    NaN    NaN              NaN
7        2  32.0  100.0   32.0            100.0
8        2  32.0  242.0    NaN              NaN
9        2   3.0    NaN    NaN              NaN
10       3   2.0    NaN   23.0            100.0
11       3  23.0  100.0    NaN              NaN
12       3   NaN    NaN    NaN              NaN
13       3  23.0    NaN    NaN              NaN

将缺失的值转换为空字符串是可能的,但会得到混合值——数字和字符串,所以下一步处理应该会有问题:

df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial     A      B A_MAX B_corresponding
0        1  12.0    NaN    31             203
1        1  31.0    NaN                      
2        1   NaN    NaN                      
3        1  12.0    NaN                      
4        1  31.0  203.0                      
5        1  10.0    NaN                      
6        1   2.0    NaN                      
7        2  32.0  100.0    32             100
8        2  32.0  242.0                      
9        2   3.0    NaN                      
10       3   2.0    NaN    23             100
11       3  23.0  100.0                      
12       3   NaN    NaN                      
13       3  23.0    NaN                      

如果你不太倾向于只使用panda,你也可以使用字典来实现同样的目的。

a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))

agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series  A_max   B_corresponding
0   1   31.0    203.0
1   2   32.0    100.0
2   3   23.0    100.0

如果您愿意,您可以将其连接到原始数据帧并屏蔽重复数据。

dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft

最新更新