我试图将代码从Stata转换为Python。
Stata 中的原始代码:
by year, sort : summarize age [aweight = wt]
通常一个简单的describe()
函数就可以了
dataframe.groupby("year")["age"].describe()
但是我找不到一种方法将aweight
选项翻译成python语言,即在分析/方差加权下给出数据集的描述性统计。
在 Python 中生成数据集的代码:
dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]}
如果我在 stata 上运行by year, sort : summarize age [aweight = wt]
,
结果是:平均值 =49.842 和 SD = 16.37
我应该怎么做才能在 Python 中获得相同的结果?
所以我写了一个函数,它执行与describe
相同的操作,除了采用权重参数。我在您提供的小数据帧上对其进行了测试,但没有详细介绍。如果您有一个大型数据帧,我尽量不使用.apply
,尽管我没有运行基准来查看我的方法是否比编写一个函数来为每个by
组进行加权描述然后使用apply
将其应用于数据帧中的每个by
组更快/更少内存密集。那可能是最简单的。
计数、最小值和最大值可以在不考虑权重的情况下取。然后我做了简单的加权均值和标准偏差 - 偏离无偏方差公式。我包括了一个频率加权选项,它应该只影响用于将方差调整为无偏估计器的样本量。频率权重应使用权重的总和作为样本数量,否则,使用数据中的计数。我用这个答案来帮助获得加权百分位数。
import pandas as pd
import numpy as np
df = pd.DataFrame({'year': [2016,2016,2020, 2020],
'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
year age wt
0 2016 41 1.2
1 2016 65 0.7
2 2020 35 0.8
3 2020 28 1.5
然后我定义下面的函数。
def weighted_groupby_describe(df, col, by, wt, frequency=False):
'''
df : dataframe
col : column for which you want statistics, must be single column
by : groupby column(s)
wt : column to use for weights
frequency : if True, use sample size as sum of weights (only effects degrees
of freedom correction for unbiased variance)
'''
if isinstance(by, list):
df = df.sort_values(by+[col])
else:
df = df.sort_values([by] + [col])
newcols = ['gb_weights', 'col_weighted', 'col_mean',
'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
assert all([c not in df.columns for c in newcols])
df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
wstd.name = 'std'
wmean = df.groupby(by)['col_weighted'].sum()
wmean.name = 'mean'
df['ngroup'] = df.groupby(by).ngroup()
quantiles = np.array([0.25, 0.5, 0.75])
weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
ngroups = df['ngroup'].max()+1
x = np.hstack([quantiles+i for i in range(ngroups)])
quantvals = np.interp(x, weighted_quantiles, df[col])
quantvals = np.reshape(quantvals, (ngroups, -1))
other = df.groupby(by)[col].agg(['min', 'max', 'count'])
stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
stats['25%'] = quantvals[:, 0]
stats['50%'] = quantvals[:, 1]
stats['75%'] = quantvals[:, 2]
colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
stats = stats[colorder]
if frequency:
sizes = df.groupby(by)[wt].sum()
else:
sizes = stats['count']
stats['weight'] = sizes
# use the "sample size" (weight) to obtain std. deviation from unbiased
# variance
stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
return stats
并进行测试。
weighted_groupby_describe(df, 'age', 'year', 'wt')
count mean std min ... 50% 75% max weight
year ...
2016 2 49.842105 16.372398 41 ... 49.842105 61.842105 65 2
2020 2 30.434783 4.714936 28 ... 30.434783 33.934783 35 2
将其与不带权重的输出进行比较。
df.groupby('year')['age'].describe()
count mean std min 25% 50% 75% max
year
2016 2.0 53.0 16.970563 41.0 47.00 53.0 59.00 65.0
2020 2.0 31.5 4.949747 28.0 29.75 31.5 33.25 35.0