我有这个数据帧:
Country AgeRepartition Count
USA above 20 10
USA less than 20 50
USA above 50 40
Canada above 20 50
Canada less than 20 10
Canada above 50 30
我想将此数据帧按年龄重新划分类型调整为一列,并将百分比作为值。
预期输出:
Country above 20 less than 20 above 50
USA 10% 50% 40%
Canada 55% 11% 33%
例如,百分比是美国有多少人在所有美国计数中超过20(10/(10+50+40((
我该怎么做?
您可以在管道中使用pivot
+pipe
,除以每行的总和并乘以100:
df2 = (df
.pivot(index='Country', columns='AgeRepartition', values='Count')
.pipe(lambda d: d.div(d.sum(axis=1), axis=0).mul(100))
)
输出:
AgeRepartition above 20 above 50 less than 20
Country
Canada 55.56 33.33 11.11
USA 10.00 40.00 50.00
另一种方法是使用groupby
+transform(sum)
+rdiv
查找百分比,使用assign
将其分配回;则pivot
:
out = (df.assign(Count=df.groupby('Country')['Count'].transform('sum').rdiv(df['Count']).mul(100)
.astype(int).astype(str).add('%'))
.pivot(*df).reset_index().rename_axis(columns=[None]))
输出:
Country above 20 above 50 less than 20
0 Canada 55% 33% 11%
1 USA 10% 40% 50%
使用值除以sum
的DataFrame.pivot
,以确保列的顺序正确使用DataFrame.reindex
:
df = (df.pivot('Country','AgeRepartition','Count')
.reindex(columns=df['AgeRepartition'].unique(), index=df['Country'].unique()))
df = df.div(df.sum(axis=1), axis=0).mul(100)
print (df)
AgeRepartition above 20 less than 20 above 50
Country
USA 10.000000 50.000000 40.000000
Canada 55.555556 11.111111 33.333333
在新的索引值和具有有序类别的新列中,针对相同顺序的另一种解决方案:
df['Country'] = pd.Categorical(df['Country'],
ordered=True,
categories=df['Country'].unique())
df['AgeRepartition'] = pd.Categorical(df['AgeRepartition'],
ordered=True,
categories=df['AgeRepartition'].unique())
df = df.pivot('Country','AgeRepartition','Count')
df = df.div(df.sum(axis=1), axis=0).mul(100)
print (df)
AgeRepartition above 20 less than 20 above 50
Country
USA 10.000000 50.000000 40.000000
Canada 55.555556 11.111111 33.333333
最简单的方法是pandas库中的.protot_table
import pandas as pd
df = pd.pivot_table(index=['Country'], columns='AgeRepartition', values='Count' aggfunc='first')