我有一个看起来像这样的数据帧:
Out[14]:
impwealth indweight
16 180000 34.200
21 384000 37.800
26 342000 39.715
30 1154000 44.375
31 421300 44.375
32 1210000 45.295
33 1062500 45.295
34 1878000 46.653
35 876000 46.653
36 925000 53.476
我想使用indweight
中的频率权重来计算列impwealth
的加权中值。我的伪代码如下:
# Sort `impwealth` in ascending order
df.sort('impwealth', 'inplace'=True)
# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)
# Search for the first occurrence of `impweight` that is greater than P
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()
# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']
这个方法看起来很笨拙,我不确定它是否正确。我没有在熊猫参考中找到一种内置的方法来做到这一点。找到加权中值的最佳方法是什么?
如果你想在纯熊猫身上做到这一点,这里有一种方法。它也不插值。(@svenkatesh,您在伪代码中丢失了累积和)
df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]
这给出了925000的中位数。
你试过wquantiles软件包吗?我以前从未使用过它,但它有一个加权中值函数,似乎至少给出了一个合理的答案(你可能需要仔细检查它是否使用了你期望的方法)。
In [12]: import weighted
In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772
此函数概括了校对器的解决方案:
def weighted_median(df, val, weight):
df_sorted = df.sort_values(val)
cumsum = df_sorted[weight].cumsum()
cutoff = df_sorted[weight].sum() / 2.
return df_sorted[cumsum >= cutoff][val].iloc[0]
在本例中,它将是weighted_median(df, 'impwealth', 'indweight')
。
您可以使用numpy:将此解决方案用于加权百分比
def weighted_quantile(values, quantiles, sample_weight=None,
values_sorted=False, old_style=False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:param values_sorted: bool, if True, then will avoid sorting of
initial array
:param old_style: if True, will correct output to be consistent
with numpy.percentile.
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1),
'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
if old_style:
# To be convenient with numpy.percentile
weighted_quantiles -= weighted_quantiles[0]
weighted_quantiles /= weighted_quantiles[-1]
else:
weighted_quantiles /= np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
以weighted_quantile(df.impwealth, quantiles=0.5, df.indweight)
的身份调用。
您也可以使用我为相同目的编写的这个函数。
注意:weighted在末尾使用插值来选择0.5分位数(您可以自己查看代码)
我编写的函数只返回一个0.5权重的边界。
import numpy as np
def weighted_median(values, weights):
''' compute the weighted median of values list. The
weighted median is computed as follows:
1- sort both lists (values and weights) based on values.
2- select the 0.5 point from the weights and return the corresponding values as results
e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
sorted values = [0, 1, 3] and corresponding sorted weights = [0.6, 0.1, 0.3] the 0.5 point on
weight corresponds to the first item which is 0. so the weighted median is 0.'''
#convert the weights into probabilities
sum_weights = sum(weights)
weights = np.array([(w*1.0)/sum_weights for w in weights])
#sort values and weights based on values
values = np.array(values)
sorted_indices = np.argsort(values)
values_sorted = values[sorted_indices]
weights_sorted = weights[sorted_indices]
#select the median point
it = np.nditer(weights_sorted, flags=['f_index'])
accumulative_probability = 0
median_index = -1
while not it.finished:
accumulative_probability += it[0]
if accumulative_probability > 0.5:
median_index = it.index
return values_sorted[median_index]
elif accumulative_probability == 0.5:
median_index = it.index
it.iternext()
next_median_index = it.index
return np.mean(values_sorted[[median_index, next_median_index]])
it.iternext()
return values_sorted[median_index]
#compare weighted_median function and np.median
print weighted_median([1, 3, 0, 7], [2,3,3,9])
print np.median([1,1,0,0,0,3,3,3,7,7,7,7,7,7,7,7,7])
您还可以使用robustats库计算加权中值:
import numpy as np
import robustats # pip install robustats
# Weighted Median
x = np.array([1.1, 5.3, 3.7, 2.1, 7.0, 9.9])
weights = np.array([1.1, 0.4, 2.1, 3.5, 1.2, 0.8])
weighted_median = robustats.weighted_median(x, weights)
print("The weighted median is {}".format(weighted_median))
有一个weightedstats包,可通过conda
和pip
使用,它执行weighted_median
。
假设您使用的是conda
,从终端(Mac/Linux)或Anaconda提示符(Win):
conda activate YOURENVIRONMENT
conda install -c conda-forge -y weightedstats
(-y
的意思是"不要要求我确认更改,只要做就行")
然后在您的Python代码中:
import pandas as pd
import weightedstats as ws
df = pd.read_csv('/your/data/file.csv')
ws.weighted_median(df['values_col'], df['weights_col'])
我不确定它是否在所有情况下都能工作,但我只是将一些简单的数据与R包matrixStats
中的函数weightedMedian()
进行了比较,两者都得到了相同的结果。
附言:顺便说一句,使用weightedstats
也可以计算weighted_mean()
,尽管NumPy:也可以计算
np.average(df['values_col'], weights=df['weights_col'])