如何在Python中实施KS检验

scipy.stats.kstest(rvs, cdf, N)可以在数据集rvs上执行KS检测。它测试数据集是否遵循一个Promipity分布，该分布在此方法的参数中指定其cdf。

现在考虑N=4800样本的数据集。我已经对此数据进行了KDE，因此具有估计的PDF。这个PDF看起来很像是双峰分布。当绘制估计的PDF和curve_fting双峰分布时，这两个图几乎是相同的。拟合的双峰分布的参数为(scale1，ey1，stdv1，scale2，safer2，pers2，stdv2(： [0.6 0.036 0.52, 0.23 1.25 0.4]

如何应用scipy.stats.kstest来测试我的估计PDF是否是双峰分布式的？作为我的零假设，我声明估计的PDF等于以下PDF：

hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)

x_grid只是一个载体，其中包含我评估估计PDF的X值。因此，pdf的每个条目的相应值为x_grid。我对hypoCdf的计算可能是不正确的。也许我不是用len(x_grid)除以np.sum(hypoDist)？

挑战：cdf kstest的参数不能指定为双峰。我也不能指定为hypoDist。

如果我想测试我的数据集是否是高斯分发，我会写：

KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)

measurementError是我执行KDE的数据集。这返回： statistic=0.459, pvalue=0.0对我来说，PVALUE为0.0

有点烦人

cdf kstest参数可以是可召唤，它实现了要测试数据的分布的累积分布函数。要使用它，您必须实现双峰分布的CDF。您希望分布是两个正常分布的混合物。您可以通过计算组成混合物的两个正常分布的CDF的加权总和来实现此分布的CDF。

这是一个脚本，显示您如何执行此操作。为了演示如何使用kstest，脚本运行kstest两次。首先，它使用分布中的不是不是的样本。如预期的那样，kstest计算第一个样本的非常小的p值。然后，它生成了从混合物中抽出的样品。对于此样本，p值不小。

import numpy as np
from scipy import stats

def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
    """
    CDF of a mixture of two normal distributions.
    """
    return (weight1*stats.norm.cdf(x, mean1, stdv1) +
            (1 - weight1)*stats.norm.cdf(x, mean2, stdv2))

# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4
n = 200
# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)
# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)
# Create a sample from the bimodal distribution.  This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution.  The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
                         (stats.norm.rvs(mean2, stdv2, size=n - n1))))
# Most of time, the p-value returned by kstest with sample2 will not
# be small.  We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)

典型输出(每次运行脚本时，数字都会有所不同(：

sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403

您可能会发现，对于您的问题，此测试效果不佳。您有4800个示例，但是在您的代码中，您的数值值只有一个或两个重要数字。除非您有充分的理由相信您的样本是从恰好的分布中绘制的，否则kstest可能会返回a 非常小p-value。

相关内容

最新更新

热门标签：