Python如何使用n、min、max、mean、std、25%、50%、75%、Skew、峰度来定义psudo随机概率



在阅读和尝试numpy.random时,我似乎找不到或创建不到我需要的东西;10参数Python伪随机值生成器,包括count、min、max、mean、sd、25%ile、50%ile(中位数(、75th%ile、偏斜和峰度。

发件人https://docs.python.org/3/library/random.html我看到这些分布是均匀的、正态(高斯(、对数正态、负指数、伽马和贝塔分布,尽管我需要直接生成仅由我的10个参数定义的分布值,而不参考分布族。

是否有numpy.srandom.xxxxxxx(n,min,max,mean,sd,25%,50%,75%,偏斜,峰度(的文档或作者,或者我可能修改以实现此目标的最接近的现有源代码是什么?

这在某种程度上与describe((相反,包括偏斜和峰度。我可以进行循环或优化,直到随机生成的数字满足标准,尽管这可能需要无限长的时间才能满足我的10个参数。

我在R中找到了optim,它生成了一个数据集,但到目前为止,我已经能够增加R optim源代码中的参数,或者用Python scipy.optimize或类似的方法复制它,尽管这些仍然取决于方法,而不是直接根据我需要的10个参数随机创建数据集;

m0 <- 20
sd0 <- 5
min <- 1
max <- 45
n <- 15
set.seed(1)
mm <- min:max
x0 <- sample(mm, size=n, replace=TRUE)
objfun <- function(x) {(mean(x)-m0)^2+(sd(x)-sd0)^2}
candfun <- function(x) {x[sample(n, size=1)] <- sample(mm, size=1)
return(x)}
objfun(x0) ##INITIAL RESULT:83.93495
o1 <- optim(par=x0, fn=objfun, gr=candfun, method="SANN", control=list(maxit=1e6))
mean(o1$par) ##INITIAL RESULT:20
sd(o1$par) ##INITIAL RESULT:5
plot(table(o1$par))

根据分布生成随机数的最常见方法如下:

  • 生成一个以0和1为界的均匀随机数(例如numpy.random.random()(
  • 取该数字的逆CDF(逆累积分布函数(

结果是一个遵循分布的数字。

在您的情况下,反向CDF(ICDF(x)(已经由您的五个参数确定——最小、最大和三个百分位数,如下所示:

  • ICDF(0(=最小值
  • ICDF(0.25(=第25百分位
  • ICDF(0.5(=第50百分位
  • ICDF(0.75(=第75百分位
  • ICDF(1(=最大值

因此,您已经对反向CDF的外观有了一些了解。现在所要做的就是以某种方式优化其他参数(均值、标准差、偏度和峰度(的反向CDF。例如,您可以在其他百分位数"填充"反向CDF,并查看它们与您要追求的参数的匹配程度。从这个意义上说,一个好的开始猜测是刚才提到的百分位数的线性插值。另一件需要记住的事情是反向CDF"永远不会下降"。


下面的代码显示了一个解决方案。它执行以下步骤:

  • 它通过线性插值计算反向CDF的初始猜测。最初的猜测由101个等距点的函数值组成,包括上面提到的五个百分位数
  • 它设置了优化的边界。优化受除五个百分位数以外的所有位置的最小值和最大值的限制
  • 它设置了其他四个参数
  • 然后,它将目标函数(_lossfunc(、初始猜测、边界和其他参数传递给SciPy的scipy.optimize.minimize方法进行优化
  • 一旦优化完成,代码将检查是否成功,如果不成功,则会引发错误
  • 如果优化成功,代码将为最终结果计算反向CDF
  • 它生成N个均匀的随机值
  • 它使用反向CDF转换这些值并返回这些值
import scipy.stats.mstats as mst
from scipy.optimize import minimize
from scipy.interpolate import interp1d
import numpy
# Define the loss function, which compares the calculated
# and ideal parameters
def _lossfunc(x, *args):
mean, stdev, skew, kurt, chunks = args
st = (
(numpy.mean(x) - mean) ** 2
+ (numpy.sqrt(numpy.var(x)) - stdev) ** 2
+ ((mst.skew(x) - skew)) ** 2
+ ((mst.kurtosis(x) - kurt)) ** 2
)
return st
def adjust(rx, percentiles):
eps = (max(rx) - min(rx)) / (3.0 * len(rx))
# Make result monotonic
for i in range(1, len(rx)):
if (
i - 2 >= 0
and rx[i - 2] < rx[i - 1]
and rx[i - 1] >= rx[i]
and rx[i - 2] < rx[i]
):
rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
elif rx[i - 1] >= rx[i]:
rx[i] = rx[i - 1] + eps
# Constrain to percentiles
for pi in range(1, len(percentiles)):
previ = percentiles[pi - 1][0]
prev = rx[previ]
curr = rx[percentiles[pi][0]]
prevideal = percentiles[pi - 1][1]
currideal = percentiles[pi][1]
realrange = max(eps, curr - prev)
idealrange = max(eps, currideal - prevideal)
for i in range(previ + 1, percentiles[pi][0]):
if rx[i] >= currideal or rx[i] <= prevideal:
rx[i] = (
prevideal
+ max(eps * (i - previ + 1 + 1), rx[i] - prev) * idealrange / realrange
)
rx[percentiles[pi][0]] = currideal
# Make monotonic again
for pi in range(1, len(percentiles)):
previ = percentiles[pi - 1][0]
curri = percentiles[pi][0]
for i in range(previ+1, curri+1):
if (
i - 2 >= 0
and rx[i - 2] < rx[i - 1]
and rx[i - 1] >= rx[i]
and rx[i - 2] < rx[i]
and i-1!=previ and i-1!=curri
):
rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
elif rx[i - 1] >= rx[i] and i!=curri:
rx[i] = rx[i - 1] + eps
return rx
# Calculates an inverse CDF for the given nine parameters.
def _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt, chunks=100):
if chunks < 0:
raise ValueError
# Minimum of 16 chunks
chunks = max(16, chunks)
# Round chunks up to closest multiple of 4
if chunks % 4 != 0:
chunks += 4 - (chunks % 4)
# Calculate initial guess for the inverse CDF; an
# interpolation of the inverse CDF through the known
# percentiles
interp = interp1d([0, 0.25, 0.5, 0.75, 1.0], [mn, p25, p50, p75, mx], kind="cubic")
rnge = mx - mn
x = interp(numpy.linspace(0, 1, chunks + 1))
# Bounds, taking percentiles into account
bounds = [(mn, mx) for i in range(chunks + 1)]
percentiles = [
[0, mn],
[int(chunks * 1 / 4), p25],
[int(chunks * 2 / 4), p50],
[int(chunks * 3 / 4), p75],
[int(chunks), mx],
]
for p in percentiles:
bounds[p[0]] = (p[1], p[1])
# Other parameters
otherParams = (mean, stdev, skew, kurt, chunks)
# Optimize the result for the given parameters
# using the initial guess and the bounds
result = minimize(
_lossfunc,  # Loss function
x,  # Initial guess
otherParams,  # Arguments
bounds=bounds,
)
rx = result.x
if result.success:
adjust(rx, percentiles)
# Minimize again
result = minimize(
_lossfunc,  # Loss function
rx,  # Initial guess
otherParams,  # Arguments
bounds=bounds,
)
rx = result.x
adjust(rx, percentiles)
# Minimize again
result = minimize(
_lossfunc,  # Loss function
rx,  # Initial guess
otherParams,  # Arguments
bounds=bounds,
)
rx = result.x
# Calculate interpolating function of result
ls = numpy.linspace(0, 1, chunks + 1)
success = result.success
icdf=interp1d(ls, rx, kind="linear")
# == To check the quality of the result
if False:
meandiff = numpy.mean(rx) - mean
stdevdiff = numpy.sqrt(numpy.var(rx)) - stdev
print(meandiff)
print(stdevdiff)
print(mst.skew(rx)-skew)
print(mst.kurtosis(rx)-kurt)
print(icdf(0)-percentiles[0][1])
print(icdf(0.25)-percentiles[1][1])
print(icdf(0.5)-percentiles[2][1])
print(icdf(0.75)-percentiles[3][1])
print(icdf(1)-percentiles[4][1])
return (icdf, success)
def random_10params(n, mn, p25, p50, p75, mx, mean, stdev, skew, kurt):
""" Note: Kurtosis as used here is Fisher's kurtosis, 
or kurtosis excess. Stdev is square root of numpy.var(). """
# Calculate inverse CDF
icdf, success = (None, False)
tries = 0
# Try up to 10 times to get a converging inverse CDF, increasing the mesh each time
chunks = 500
while tries < 10:
icdf, success = _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt,chunks=chunks)
tries+=1
chunks+=100
if success: break
if not success:
print("Warning: Estimation failed and may be inaccurate")
# Generate uniform random variables
npr=numpy.random.random(size=n)
# Transform them with the inverse CDF
return icdf(npr)

示例:

print(random_10params(n=1000, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383))

最后一点注意:如果您可以访问底层数据点,而不仅仅是它们的统计数据,您还可以使用其他方法从这些数据点形成的分布中进行采样。示例包括内核密度估计直方图回归模型(尤其是时间序列数据(。请参见基于现有数据生成随机数据。

最新更新