Presto相当于Redshift的PERCENTILE_DISC



在RedShift中给出以下查询:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

我需要将上述查询转换为相应的Presto语法。我写的相应的presto查询是:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

在这里,一切都很好,但是在行中显示错误:

PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
    over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,

其相应的Presto语法是什么?

如果Presto支持的嵌套窗口函数,则可以将NTH_VALUE与P*Count(*(一起使用(FAIR ...(来查找对应于" P'th"的偏移量窗口中的百分位数。由于Presto不支持这一点,因此您需要加入一个子查询,以计算窗口中的记录数:

SELECT
  my_table.window_column,
  /* Replace :p with the desired percentile (in your case, 0.02) */
  NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
    OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
  SELECT
    window_column,
    COUNT(*) AS records_in_window
  FROM my_table
  GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column

上面的概念上是接近的,但由于:p*subquery.records_in_window是浮动而失败,因此偏移需要是整数。您有一些选择如何解决此问题。例如,如果您找到中位数,则简单地将其四舍五入到最近的整数作品。如果您发现第二个百分位数,则舍入将无法正常工作,因为它通常会给您0,并且偏移量从1开始。在这种情况下,将天花板舍入到最近的整数可能会更好。

我正在对普雷斯托的中位数进行一些研究,并找到了一种对我有用的解决方案:

例如,我有一个加入表A_join_b,其中列A_ID和B_ID。

我想找到与单个b

相关的数量的中值

选择appprox_percentile(计数,0.5(从 (( 选择计数(*(为count,nardative_id 来自A_JOIN_B 组B_ID (如计数;

相关内容

  • 没有找到相关文章

最新更新