Spark 中的加权平均中位数四分位数

我有一个Spark SQL数据帧：

id	值	权重
1	2	4
1	5	2
2	1	4
	6	2
2	9	4
3	2	4

在计算之前，您应该对Value列进行一个小的转换：

F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))

array_repeat从您的数字中创建一个数组 - 数组中的数字将按照"权重"列中指定的次数重复(强制转换为 int 是必要的，因为array_repeat期望此列是 int 类型。在此部分之后，2的第一个值将转换为[2,2,2,2]。
然后，explode将为数组中的每个元素创建一行。因此，行[2,2,2,2]将转换为 4 行，每行包含一个整数2。
然后，您可以计算统计数据，结果将应用权重，因为您的数据帧现在根据权重进行转换。

完整示例：

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#|  1|          3.0|            2.0|                    2.0|                   4.25|
#|  2|          5.2|            6.0|                    1.0|                    9.0|
#|  3|          2.0|            2.0|                    2.0|                    2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

相关内容

最新更新

热门标签：