计算 Spark / Scala 中数据帧所有行的平均值的置信区间



我需要计算数据帧相对于 value3 列平均值的置信区间、最大置信区间和最小置信区间,我需要将其应用于我的所有数据帧。这是我的数据帧:

+--------+---------+------+
|  value1| value2  |value3|
+--------+---------+------+
|   a    |  2      |   3  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   c    |  3      |   4  |
+--------+---------+------+ 

所以我的输出应该如下所示(x 是计算结果(:

+--------+---------+------+-------+--------+----------+
|  value1| value2  |value3|max_int|min_int |    int   |      |
+--------+---------+------+-------+--------+----------+
|   a    |  2      |   3  |   x   |   x    |     x    |
+--------+---------+------+-------+--------+----------+
|   b    |  5      |   4  |   x   |   x    |     x    |
+--------+---------+------+-------+--------+----------+
|   b    |  5      |   4  |   x   |   x    |     x    |
+--------+---------+------+-------+--------+----------+
|   c    |  3      |   4  |   x   |   x    |     x    |
+--------+---------+------+-------+--------+----------+

由于我找不到它的内置函数,所以我找到了以下函数来做到这一点。这是计算它的代码。

import org.apache.commons.math3.distribution.TDistribution
import org.apache.commons.math3.exception.MathIllegalArgumentException
import org.apache.commons.math3.stat.descriptive.SummaryStatistics
import scala.collection.JavaConversions._
object ConfidenceIntervalApp {
def main(args: Array[String]): Unit = {
///my dataframe name is df
}
// Calculate 95% confidence interval
val ci: Double = calcMeanCI(stats, 0.95)
println(String.format("Mean: %f", stats.getMean))
val lower: Double = stats.getMean - ci
val upper: Double = stats.getMean + ci
}
def calcMeanCI(stats:Rdd, level: Double): Double =
try {
// Create T Distribution with N-1 degrees of freedom
val tDist: TDistribution = new TDistribution(stats.getN - 1)
// Calculate critical value
val critVal: Double =
tDist.inverseCumulativeProbability(1.0 - (1 - level) / 2)
// Calculate confidence interval
critVal * stats.getStandardDeviation / Math.sqrt(stats.getN)
} catch {
case e: MathIllegalArgumentException => java.lang.Double.NaN
}
}

您能否帮助或至少指导我如何在列上应用它。提前谢谢。

你可以帮我吗?

你可以做类似的事情

val cntInterval = df.select("value3").rdd.countApprox(timeout = 1000L,confidence = 0.95)
val (lowCnt,highCnt) = (cntInterval.getFinalValue().low, cntInterval.getFinalValue().high)
df.withColumn("max_int", lit(highCnt))
.withColumn("min_int", lit(lowCnt))
.withColumn("int", lit(cntInterval.getFinalValue().toString()))
.show(false)

我从 In spark 中获得了帮助,如何快速估计数据帧中的元素数量

最新更新