i具有下面显示的数据框架。它被排序以便" pop"在每个"状态"方面按顺序降序。现在,我想相对于每个"状态"总和" pop"的最大三个值,我应该如何做?
import pandas as pd
d = [['X','q',123383],['X','w',43857349],['X','e',236657],['X','r',23574594],
['Y','t',547853],['Y','y',46282134],['Y','u',43857439],['Y','i',32654893],['Y','i',95678312]]
df = pd.DataFrame(d, columns = ['STATE','COUNTY','POP'])
df.sort_values(['STATE','POP'], ascending=[True, False]).set_index(['STATE','COUNTY'])
print(sorted_df)
# sorted_df:
POP
STATE COUNTY
X w 43857349
r 23574594
e 236657
q 123383
Y i 95678312
y 46282134
u 43857439
i 32654893
t 547853
有nlargest
不需要前部:
df.groupby(['STATE']).POP.nlargest(3)
给你
STATE
X 1 43857349
3 23574594
2 236657
Y 8 95678312
5 46282134
6 43857439
Name: POP, dtype: int64
,如果您只关心总和:
df.groupby(['STATE']).POP.nlargest(3).sum(level=0)
给出:
STATE
X 67668600
Y 185817885
Name: POP, dtype: int64
确保对数据框进行排序后重新分配(也许您是要调用结果sorted_df
(。
按国家级别组成的组(或level=0
鉴于它是州和县的多指数(,然后应用lambda,将lambda置于前三名状态(并总和结果。
top_n = 3
df = df.sort_values(['STATE','POP'], ascending=[True, False]).set_index(['STATE','COUNTY'])
>>> df.groupby(level='STATE').apply(lambda x: x.head(top_n).sum())
POP
STATE
X 67668600 # w: 43857349 + r: 23574594 + e: 236657
Y 185817885 # i: 95678312 + y: 46282134 + u: 43857439