在python中使用describe()获取具有(分析)权重的描述性统计信息



我试图将代码从Stata转换为Python。

Stata 中的原始代码:

by year, sort : summarize age [aweight = wt]

通常一个简单的describe()函数就可以了

dataframe.groupby("year")["age"].describe()

但是我找不到一种方法将aweight选项翻译成python语言,即在分析/方差加权下给出数据集的描述性统计。

在 Python 中生成数据集的代码:

dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]}

如果我在 stata 上运行by year, sort : summarize age [aweight = wt]

结果是:平均值 =49.842 和 SD = 16.37

我应该怎么做才能在 Python 中获得相同的结果?

所以我写了一个函数,它执行与describe相同的操作,除了采用权重参数。我在您提供的小数据帧上对其进行了测试,但没有详细介绍。如果您有一个大型数据帧,我尽量不使用.apply,尽管我没有运行基准来查看我的方法是否比编写一个函数来为每个by组进行加权描述然后使用apply将其应用于数据帧中的每个by组更快/更少内存密集。那可能是最简单的。

计数、最小值和最大值可以在不考虑权重的情况下取。然后我做了简单的加权均值和标准偏差 - 偏离无偏方差公式。我包括了一个频率加权选项,它应该只影响用于将方差调整为无偏估计器的样本量。频率权重应使用权重的总和作为样本数量,否则,使用数据中的计数。我用这个答案来帮助获得加权百分位数。

import pandas as pd
import numpy as np
df = pd.DataFrame({'year': [2016,2016,2020, 2020], 
'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
year  age   wt
0  2016   41  1.2
1  2016   65  0.7
2  2020   35  0.8
3  2020   28  1.5

然后我定义下面的函数。

def weighted_groupby_describe(df, col, by, wt, frequency=False):
'''
df : dataframe
col : column for which you want statistics, must be single column
by : groupby column(s)
wt : column to use for weights
frequency : if True, use sample size as sum of weights (only effects degrees
of freedom correction for unbiased variance)
'''

if isinstance(by, list):
df = df.sort_values(by+[col])
else:
df = df.sort_values([by] + [col])

newcols = ['gb_weights', 'col_weighted', 'col_mean', 
'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
assert all([c not in df.columns for c in newcols])

df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')

df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()

df['col_weighted'] = df.eval('{}*gb_weights'.format(col))

df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')

df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')

wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
wstd.name = 'std'

wmean = df.groupby(by)['col_weighted'].sum()
wmean.name = 'mean'

df['ngroup'] = df.groupby(by).ngroup()
quantiles = np.array([0.25, 0.5, 0.75])
weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
ngroups = df['ngroup'].max()+1
x = np.hstack([quantiles+i for i in range(ngroups)])
quantvals = np.interp(x, weighted_quantiles, df[col])
quantvals = np.reshape(quantvals, (ngroups, -1))

other = df.groupby(by)[col].agg(['min', 'max', 'count'])

stats = pd.concat([wmean, wstd, other], axis=1, sort=False)

stats['25%'] = quantvals[:, 0]
stats['50%'] = quantvals[:, 1]
stats['75%'] = quantvals[:, 2]

colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
stats = stats[colorder]

if frequency:
sizes = df.groupby(by)[wt].sum()
else:
sizes = stats['count']

stats['weight'] = sizes

# use the "sample size" (weight) to obtain std. deviation from unbiased
# variance
stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')

return stats

并进行测试。

weighted_groupby_describe(df, 'age', 'year', 'wt')
count       mean        std  min  ...        50%        75%  max  weight
year                                    ...                                   
2016      2  49.842105  16.372398   41  ...  49.842105  61.842105   65       2
2020      2  30.434783   4.714936   28  ...  30.434783  33.934783   35       2

将其与不带权重的输出进行比较。

df.groupby('year')['age'].describe()
count  mean        std   min    25%   50%    75%   max
year                                                        
2016    2.0  53.0  16.970563  41.0  47.00  53.0  59.00  65.0
2020    2.0  31.5   4.949747  28.0  29.75  31.5  33.25  35.0

最新更新