根据样本数据计算置信区间



我有样本数据,我想计算一个置信区间,假设呈正态分布。

我已经找到并安装了 numpy 和 scipy 软件包,并让 numpy 返回平均值和标准偏差(numpy.mean(data(,数据是一个列表(。任何关于获得样本置信区间的建议将不胜感激。

import numpy as np
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

你可以这样计算。

这里是 shasan 代码的缩短版本,计算数组 a 平均值的 95% 置信区间:

import numpy as np, scipy.stats as st
st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))

但是使用StatsModels的tconfint_mean可以说更好:

import statsmodels.stats.api as sms
sms.DescrStatsW(a).tconfint_mean()

两者的基本假设是样本(数组a(是独立于具有未知标准差的正态分布绘制的(参见MathWorld或Wikipedia(。

对于大样本量 n,样本均值呈正态分布,可以使用 st.norm.interval() 计算其置信区间(如 Jaime 的评论中所建议的(。 但上述解决方案对于小 n 也是正确的,其中 st.norm.interval() 给出的置信区间太窄(即"假置信"(。 有关更多详细信息,请参阅我对类似问题的回答(以及Russ的评论之一(。

下面是一个示例,其中正确的选项给出(基本上(相同的置信区间:

In [9]: a = range(10,14)
In [10]: mean_confidence_interval(a)
Out[10]: (11.5, 9.4457397432391215, 13.554260256760879)
In [11]: st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
Out[11]: (9.4457397432391215, 13.554260256760879)
In [12]: sms.DescrStatsW(a).tconfint_mean()
Out[12]: (9.4457397432391197, 13.55426025676088)

最后,使用st.norm.interval()的错误结果:

In [13]: st.norm.interval(0.95, loc=np.mean(a), scale=st.sem(a))
Out[13]: (10.23484868811834, 12.76515131188166)

Python 3.8 开始,标准库提供 NormalDist 对象作为statistics模块的一部分:

from statistics import NormalDist
def confidence_interval(data, confidence=0.95):
  dist = NormalDist.from_samples(data)
  z = NormalDist().inv_cdf((1 + confidence) / 2.)
  h = dist.stdev * z / ((len(data) - 1) ** .5)
  return dist.mean - h, dist.mean + h

这:

  • 从数据样本(NormalDist.from_samples(data)(创建一个NormalDist对象,这使我们能够通过NormalDist.meanNormalDist.stdev访问样本的平均值和标准差。

  • 使用累积分布函数的反函数 (inv_cdf ( 根据给定置信度的标准正态分布(用 NormalDist() 表示(计算Z-score

  • 根据样本的标准差和平均值生成置信区间。


这假设样本量足够大(假设超过 ~100 个点(,以便使用标准正态分布而不是学生的 t 分布来计算z值。

首先从查找表中查找所需置信区间的 z 值。 然后mean +/- z*sigma置信区间,其中sigma是样本均值的估计标准差,由sigma = s / sqrt(n)给出,其中s是根据样本数据计算的标准差,n是样本数量。

基于原始内容,但有一些具体的例子:

import numpy as np
def mean_confidence_interval(data, confidence: float = 0.95) -> tuple[float, np.ndarray]:
    """
    Returns (tuple of) the mean and confidence interval for given data.
    Data is a np.arrayable iterable.
    ref:
        - https://stackoverflow.com/a/15034143/1601580
        - https://github.com/WangYueFt/rfs/blob/f8c837ba93c62dd0ac68a2f4019c619aa86b8421/eval/meta_eval.py#L19
    """
    import scipy.stats
    import numpy as np
    a: np.ndarray = 1.0 * np.array(data)
    n: int = len(a)
    if n == 1:
        import logging
        logging.warning('The first dimension of your data is 1, perhaps you meant to transpose your data? or remove the'
                        'singleton dimension?')
    m, se = a.mean(), scipy.stats.sem(a)
    tp = scipy.stats.t.ppf((1 + confidence) / 2., n - 1)
    h = se * tp
    return m, h
def ci_test_float():
    import numpy as np
    # - one WRONG data set of size 1 by N
    data = np.random.randn(1, 30)  # gives an error becuase len sets n=1, so not this shape!
    m, ci = mean_confidence_interval(data)
    print('-- you should get a mean and a list of nan ci (since data is in wrong format, it thinks its 30 data sets of '
          'length 1.')
    print(m, ci)
    # right data as N by 1
    data = np.random.randn(30, 1)
    m, ci = mean_confidence_interval(data)
    print('-- gives a mean and a list of length 1 for a single CI (since it thinks you have a single dat aset)')
    print(m, ci)
    # multiple data sets (7) of size N (=30)
    data = np.random.randn(30, 7)
    print('-- gives 7 CIs for the 7 data sets of length 30. 30 is the number ud want large if you were using z(p)'
          'due to the CLT.')
    m, ci = mean_confidence_interval(data)
    print(m, ci)
ci_test_float()

输出:

-- you should get a mean and a list of nan ci (since data is in wrong format, it thinks its 30 data sets of length 1.
0.1431623130952463 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-- gives a mean and a list of length 1 for a single CI (since it thinks you have a single dat aset)
0.04947206018132864 [0.40627264]
-- gives 7 CIs for the 7 data sets of length 30. 30 is the number ud want large if you were using z(p)due to the CLT.
-0.03585104402718902 [0.31867309 0.35619134 0.34860011 0.3812853  0.44334033 0.35841138
 0.40739732]

我认为Num_datasets Num_samples是正确的,但如果评论部分没有让我知道。

<小时 />

它适用于什么类型的数据?

我认为它可以用于任何数据,因为以下原因:

我相信这很好,因为平均值和 std 是针对一般数值数据计算的,而 z_p/t_p 值只考虑置信区间和数据大小,因此它与数据分布的假设无关。

所以我相信它可以用于回归和分类。

<小时 />

作为奖励,一个几乎只使用火炬的火炬实现:

def torch_compute_confidence_interval(data: Tensor,
                                      confidence: float = 0.95
                                      ) -> Tensor:
    """
    Computes the confidence interval for a given survey of a data set.
    """
    n: int = len(data)
    mean: Tensor = data.mean()
    # se: Tensor = scipy.stats.sem(data)  # compute standard error
    # se, mean: Tensor = torch.std_mean(data, unbiased=True)  # compute standard error
    se: Tensor = data.std(unbiased=True) / (n ** 0.5)
    t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
    ci = t_p * se
    return mean, ci
<小时 />

关于CI的一些评论(或见 https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1(:

"""
Review for confidence intervals. Confidence intervals say that the true mean is inside the estimated confidence interval
(the r.v. the user generates). In particular it says:
    Pr[mu^* in [mu_n +- t.val(p) * std_n / sqrt(n) ] ] >= p
e.g. p = 0.95
This does not say that for a specific CI you compute the true mean is in that interval with prob 0.95. Instead it means
that if you surveyed/sampled 100 data sets D_n = {x_i}^n_{i=1} of size n (where n is ideally >=30) then for 95 of those
you'd expect to have the truee mean inside the CI compute for that current data set. Note you can never check for which
ones mu^* is in the CI since mu^* is unknown. If you knew mu^* you wouldn't need to estimate it. This analysis assumes
that the the estimator/value your estimating is the true mean using the sample mean (estimator). Since it usually uses
the t.val or z.val (second for the standardozed r.v. of a normal) then it means the approximation that mu_n ~ gaussian
must hold. This is most likely true if n >= 0. Note this is similar to statistical learning theory where we use
the MLE/ERM estimator to choose a function with delta, gamma etc reasoning. Note that if you do algebra you can also
say that the sample mean is in that interval but wrt mu^* but that is borning, no one cares since you do not know mu^*
so it's not helpful.
An example use could be for computing the CI of the loss (e.g. 0-1, CE loss, etc). The mu^* you want is the expected
risk. So x_i = loss(f(x_i), y_i) and you are computing the CI for what is the true expected risk for that specific loss
function you choose. So mu_n = emperical mean of the loss and std_n = (unbiased) estimate of the std and then you can
simply plug in the values.
Assumptions for p-CI:
    - we are making a statement that mu^* is in mu+-pCI = mu+-t_p * sig_n / sqrt n, sig_n ~ Var[x] is inside the CI
    p% of the time.
    - we are estimating mu^, a mean
    - since the quantity of interest is mu^, then the z_p value (or p-value, depending which one is the unknown), is
    computed using the normal distribution.
    - p(mu) ~ N(mu; mu_n, sig_n/ sqrt n), vial CTL which holds for sample means. Ideally n >= 30.
    - x ~ p^*(x) are iid.
Std_n vs t_p*std_n/ sqrt(n)
    - std_n = var(x) is more pessimistic but holds always. Never shrinks as n->infity
    - but if n is small then pCI might be too small and your "lying to yourself". So if you have very small data
    perhaps doing std_n for the CI is better. That holds with prob 99.9%. Hopefuly std is not too large for your
    experiments to be invalidated.
ref:
    - https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1
    - https://stackoverflow.com/questions/70356922/what-is-the-proper-way-to-compute-95-confidence-intervals-with-pytorch-for-clas
    - https://www.youtube.com/watch?v=MzvRQFYUEFU&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=205
"""

关于乌尔里希的答案 - 即使用 t 值。当真实方差未知时,我们使用它。此时,您拥有的唯一数据是示例数据。

对于Bogatron的答案,这涉及z表。当方差已知且已提供时,将使用 z 表。然后,您还有示例数据。西格玛不是样本均值的估计标准差。这是众所周知的。

假设方差是已知的,我们需要 95% 的置信度:

from scipy.stats import norm
alpha = 0.95
# Define our z
ci = alpha + (1-alpha)/2
#Lower Interval, where n is sample siz
c_lb = sample_mean - norm.ppf(ci)*((sigma/(n**0.5)))
c_ub = sample_mean + norm.ppf(ci)*((sigma/(n**0.5)))

对于只有样本数据和未知方差(意味着方差必须仅根据样本数据计算(,Ulrich 的答案非常有效。但是,您可能希望指定置信区间。如果数据为 a,并且希望置信区间为 0.95:

import statsmodels.stats.api as sms
conf = sms.DescrStatsW(a).tconfint_mean(alpha=0.05)
conf

最新更新