来自值计数的百分位数

我想用Python从多个大向量的集合中计算百分位数。而不是尝试连接向量，然后把结果巨大的向量通过numpy。百分位，有没有更有效的方法?

我的想法是，首先，计算不同值的频率(例如使用scipy.stats.itemfreq)，其次，结合这些不同向量的项目频率，最后，从计数中计算百分位数。

不幸的是，我没有能够找到函数来组合频率表(这不是很简单，因为不同的表可能涵盖不同的项目)，或者从项目频率表中计算百分位数。我是否需要实现这些，或者我可以使用现有的Python函数?这些函数是什么?

根据Julien Palard的建议使用collections.Counter解决第一个问题(计算和组合频率表)，以及我对第二个问题(从频率表计算百分位数)的实现:

from collections import Counter
def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
    """Returns [(percentile, value)] with nearest rank percentiles.
    Percentile 0: <min_value>, 100: <max_value>.
    cnts_dict: { <value>: <count> }
    percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
    """
    assert all(0 <= p <= 100 for p in percentiles_to_calc)
    percentiles = []
    num = sum(cnts_dict.values())
    cnts = sorted(cnts_dict.items())
    curr_cnts_pos = 0  # current position in cnts
    curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
    for p in sorted(percentiles_to_calc):
        if p < 100:
            percentile_pos = p / 100.0 * num
            while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                curr_cnts_pos += 1
                curr_pos += cnts[curr_cnts_pos][1]
            percentiles.append((p, cnts[curr_cnts_pos][0]))
        else:
            percentiles.append((p, cnts[-1][0]))  # we could add a small value
    return percentiles
cnts_dict = Counter()
for segment in segment_iterator:
    cnts_dict += Counter(segment)
percentiles = calc_percentiles(cnts_dict)

同样的问题已经困扰我很长时间了，我决定努力一下。我们的想法是从scipy.stats中重新使用一些东西，这样我们就有了cdf和ppf。

有一个类rv_descrete用于子类化。在源代码中浏览类似的继承者，我发现rv_sample有一个有趣的描述:A 'sample' discrete distribution defined by the support and values.。这个类没有在API中公开，但是当您直接将值传递给rv_descrete时使用它。

因此，这是一个可能的解决方案:

import numpy as np
import scipy.stats
# some mapping from numeric values to the frequencies
freqs = np.array([
    [1, 3],
    [2, 10],
    [3, 13],
    [4, 12],
    [5, 9],
    [6, 4],
])
def distrib_from_freqs(arr: np.ndarray) -> scipy.stats.rv_discrete:
    pmf = arr[:, 1] / arr[:, 1].sum()
    distrib = scipy.stats.rv_discrete(values=(arr[:, 0], pmf))
    return distrib
distrib = distrib_from_freqs(freqs)
print(distrib.pmf(freqs[:, 0]))
print(distrib.cdf(freqs[:, 0]))
print(distrib.ppf(distrib.cdf(freqs[:, 0])))  # percentiles
# [0.05882353 0.19607843 0.25490196 0.23529412 0.17647059 0.07843137]
# [0.05882353 0.25490196 0.50980392 0.74509804 0.92156863 1.        ]
# [1. 2. 3. 4. 5. 6.]
# max, median, 1st quartile, 3rd quartile
print(distrib.ppf([1.0, 0.5, 0.25, 0.75]))
# [6. 3. 2. 5.]
# the distribution describes values from (0, 1] 
#   and 0 results with a value right before the minimum:
print(distrib.ppf(0))
# 0.0

相关内容

最新更新

热门标签：