这是熊猫的后续问题:如何在子类别中对前N个观察进行子集(和求和(?在那里演示了如何在此数据帧中找到每年前 3 个月的总和:
示例数据帧
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
5 1949 June 135
.
.
.
137 1960 June 535
138 1960 July 622
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432
这样你最终会得到这个:
year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
5 1954 859
6 1955 1026
7 1956 1192
8 1957 1354
9 1958 1431
10 1959 1579
11 1960 176
432 for 1949
数是148+148+136 for the months July, August and September.
之和 我现在的问题是:
是否可以进行相同的计算,同时将相应的子类别作为列表保留在其自己的列中?
期望的输出
(我只检查了1949年的实际金额,1950年是编造的(:
year passengers months
0 1949 432 July, August, September
1 1950 498 August, September, December
2 1951 582 .
3 1952 690 .
4 1953 779 .
5 1954 859 .
6 1955 1026 .
7 1956 1192 .
8 1957 1354 .
9 1958 1431 .
10 1959 1579 .
11 1960 176 .
可重现的代码和数据:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
print(df.head())
df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print(df2.head())
DF:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
DF2:
year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
感谢您的任何建议!
将自定义函数与GroupBy.apply
一起使用,idea 首先按sort_values
排序,然后调用head
以获取每组的最高值:
def f(x):
x = x.head(3)
names = ['passengers','months']
return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names)
df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index()
print(df2.head())
year passengers months
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
或者 - 分组,然后使用pd.DataFrame.nlargest
而不是自定义函数/lambda 应用,然后在索引上重新分组并应用合适的agg
s,例如:
new_df = (
df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers')
.groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join})
# optionally reset index
# .reset_index()
)
然后,这将为您提供:
passengers month
year
1949 432 July, August, September
1950 498 July, August, September
1951 582 July, August, September
1952 690 August, July, June
1953 779 August, July, June
1954 859 July, August, June
...
似乎year
作为索引在结果帧中有意义,但如果不是,请应用.reset_index()
。
你可以
In [69]: df.groupby('year').apply(lambda x:
x.nlargest(3, 'passengers').agg(
{'passengers': 'sum', 'month': lambda x: ', '.join(x.values)}
)).reset_index()
Out[69]:
year passengers month
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
5 1954 859 July, August, June
6 1955 1026 July, August, June
7 1956 1192 July, August, June
8 1957 1354 August, July, June
9 1958 1431 August, July, June
10 1959 1579 August, July, June
11 1960 1763 July, August, June
这是一个使用nlargest
的解决方案。
def largest(x, k):
vals = x.nlargest(n=k, columns=['passengers'])
return [vals['passengers'].sum(), vals['month'].tolist()]
g = df.groupby('year').apply(largest, k=3).reset_index()
joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months'])
res = g.drop(0, axis=1).join(joiner)
print(res)
year passengers months
0 1949 382 [March, April, May]
我故意months
保留为列表,如果您愿意,您可以转换为逗号分隔的字符串。