熊猫:如何按类别分组(和求和)并保留子类别中的信息



这是熊猫的后续问题:如何在子类别中对前N个观察进行子集(和求和(?在那里演示了如何在此数据帧中找到每年前 3 个月的总和:

示例数据帧

year      month   passengers
0    1949    January         112
1    1949   February         118
2    1949      March         132
3    1949      April         129
4    1949        May         121
5    1949       June         135
.
.
.
137  1960       June         535
138  1960       July         622
139  1960     August         606
140  1960  September         508
141  1960    October         461
142  1960   November         390
143  1960   December         432

这样你最终会得到这个:

year  passengers
0   1949         432
1   1950         498
2   1951         582
3   1952         690
4   1953         779
5   1954         859
6   1955        1026
7   1956        1192
8   1957        1354
9   1958        1431
10  1959        1579
11  1960         176

432 for 1949数是148+148+136 for the months July, August and September.之和 我现在的问题是:

是否可以进行相同的计算,同时将相应的子类别作为列表保留在其自己的列中?

期望的输出

(我只检查了1949年的实际金额,1950年是编造的(:

year  passengers  months
0   1949         432  July, August, September 
1   1950         498  August, September, December
2   1951         582  .
3   1952         690  .
4   1953         779  .
5   1954         859  .
6   1955        1026  .
7   1956        1192  .
8   1957        1354  .
9   1958        1431  .
10  1959        1579  .
11  1960         176  .

可重现的代码和数据:

import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
print(df.head())
df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print(df2.head())

DF:

year     month  passengers
0  1949   January         112
1  1949  February         118
2  1949     March         132
3  1949     April         129
4  1949       May         121

DF2:

year  passengers
0  1949         432
1  1950         498
2  1951         582
3  1952         690
4  1953         779

感谢您的任何建议!

将自定义函数与GroupBy.apply一起使用,idea 首先按sort_values排序,然后调用head以获取每组的最高值:

def f(x):
x = x.head(3)
names = ['passengers','months']
return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names)
df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index()
print(df2.head())
year  passengers                   months
0  1949         432  July, August, September
1  1950         498  July, August, September
2  1951         582  July, August, September
3  1952         690       August, July, June
4  1953         779       August, July, June

或者 - 分组,然后使用pd.DataFrame.nlargest而不是自定义函数/lambda 应用,然后在索引上重新分组并应用合适的aggs,例如:

new_df = (
df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers')
.groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join})
# optionally reset index
# .reset_index()
)

然后,这将为您提供:

passengers                    month
year                                     
1949         432  July, August, September
1950         498  July, August, September
1951         582  July, August, September
1952         690       August, July, June
1953         779       August, July, June
1954         859       July, August, June
...

似乎year作为索引在结果帧中有意义,但如果不是,请应用.reset_index()

你可以

In [69]: df.groupby('year').apply(lambda x: 
x.nlargest(3, 'passengers').agg(
{'passengers': 'sum', 'month': lambda x: ', '.join(x.values)}
)).reset_index()
Out[69]:
year  passengers                    month
0   1949         432  July, August, September
1   1950         498  July, August, September
2   1951         582  July, August, September
3   1952         690       August, July, June
4   1953         779       August, July, June
5   1954         859       July, August, June
6   1955        1026       July, August, June
7   1956        1192       July, August, June
8   1957        1354       August, July, June
9   1958        1431       August, July, June
10  1959        1579       August, July, June
11  1960        1763       July, August, June

这是一个使用nlargest的解决方案。

def largest(x, k):
vals = x.nlargest(n=k, columns=['passengers'])
return [vals['passengers'].sum(), vals['month'].tolist()]
g = df.groupby('year').apply(largest, k=3).reset_index()
joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months'])
res = g.drop(0, axis=1).join(joiner)
print(res)
year  passengers               months
0  1949         382  [March, April, May]

我故意months保留为列表,如果您愿意,您可以转换为逗号分隔的字符串。

最新更新