我刚接触spark环境。我有一个列名称如下的数据集:
user_id, Date_time, order_quantity
我想计算每个user_id的order_quantity的第90个百分位数。
如果是sql,我将使用以下查询:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
但是,spark没有内置支持使用percentile_cont函数。
关于如何在spark上实现上述数据集的任何建议?如果需要更多的信息,请告诉我。
我有一个PERCENTILE_DISC(0.9)的解决方案,它将返回最接近百分位数0.9的离散order_quantity(没有插值)。这个想法是计算PERCENT_RANK,减去0.9并计算绝对值,然后取最小值:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;
我也在处理类似的问题。我在SAP HANA工作,然后转到Databricks的Spark SQL。我已经迁移了以下SAP HANA查询:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
在你的特殊情况下,它应该如下所示:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)