我正在使用 PERCENT_RANK(( 函数来获取给定数据集的百分位指标。下面是查询:
WITH time_values AS (
SELECT
var,
(end_time - start_time) * 1.0 / 3600000000 AS num_hours,
PERCENT_RANK() OVER (PARTITION BY var1 ORDER BY num_hours) AS pct_rank
FROM table
WHERE
start_time >= 1493596800000000
AND end_time < 1493683200000000
)
SELECT
var,
pct_rank,
num_hours
FROM time_values
WHERE pct_rank IN (0.25, 0.5, 0.8, 0.99)
ORDER BY 1, 2;
但是,考虑到 PERCENT_RANK(( 的工作方式,我不会为我关心的每个百分位数获得完全匹配,因此输出将如下所示:
var | pct_rank | num_hours
-----+----------+------------------
a | 0.25 | 31.752826672222
a | 0.5 | 171.844016125555
b | 0.25 | 230.704589953055
b | 0.5 | 246.269648327222
我正在寻找一种方法来返回我关心的每个百分位数的值,或者如果未找到完全匹配项,则返回最接近该百分位数的值。这可行吗?
您可以对记录进行排名,然后选择具有百分位数截止前排名的最大值:
WITH time_values AS (
SELECT
var,
(end_time - start_time) * 1.0 / 3600000000 AS num_hours,
row_number() OVER (PARTITION BY var1 ORDER BY num_hours) AS rank,
count(1) OVER (PARTITION BY var1) AS records
FROM table
WHERE
start_time >= 1493596800000000
AND end_time < 1493683200000000
)
SELECT
var,
max(case when 1.0*rank/count<0.25 then num_hours end) as percentile_25,
max(case when 1.0*rank/count<0.50 then num_hours end) as percentile_50,
max(case when 1.0*rank/count<0.80 then num_hours end) as percentile_80,
max(case when 1.0*rank/count<0.99 then num_hours end) as percentile_99
FROM time_values
ORDER BY 1;
或者对PERCENT_RANK()
输出做同样的事情,如果你真的想让输出按行而不是按列,那么只需合并最后一步的结果即可获得所需的结构