等价于apache spark sql中的percentile_cont函数



我刚接触spark环境。我有一个列名称如下的数据集:


user_id, Date_time, order_quantity

我想计算每个user_id的order_quantity的第90个百分位数。

如果是sql,我将使用以下查询:

%sql 
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)

但是,spark没有内置支持使用percentile_cont函数。

关于如何在spark上实现上述数据集的任何建议?如果需要更多的信息,请告诉我。

我有一个PERCENTILE_DISC(0.9)的解决方案,它将返回最接近百分位数0.9的离散order_quantity(没有插值)。这个想法是计算PERCENT_RANK,减去0.9并计算绝对值,然后取最小值: %sql WITH temp1 AS ( SELECT user_id, ABS(PERCENTILE_RANK () OVER (PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp SELECT user_id, FIRST_VALUE(order_quantity) OVER (PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90 FROM temp1;

我也在处理类似的问题。我在SAP HANA工作,然后转到Databricks的Spark SQL。我已经迁移了以下SAP HANA查询:

SELECT 
    DISTINCT ITEM_ID, 
    LOCATION_ID, 
    PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y, 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO 
FROM MY_TABLE

SELECT DISTINCT
  ITEM_ID,
  LOCATION_ID,
  PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
  PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
    delta.`MY_TABLE`

在你的特殊情况下,它应该如下所示:

SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)

相关内容

  • 没有找到相关文章