如何根据百分位数过滤表格，然后在 HQL 中随机抽样?

我正在尝试从表中随机抽取 200 行，但首先我想过滤它以仅从变量中选择前 1% 的值。

我收到以下错误 -

编译语句时出错：失败：解析异常行 3：31 无法识别"选择"percentile_approx"("附近的输入表达式规范

以下是我的查询-

> with sample_pop as (select * from
> mytable a where
> a.transaction_amount > (select
> percentile_approx(transaction_amount, 0.99) as top1
>                             from mytable) )
> 
> select * from sample_pop  distribute by rand(1) sort by rand(1) limit
> 200;

我不认为 Hive 支持标量子查询，就像你使用它们的方式一样(仅适用于IN/EXISTS(。因此，将逻辑移至FROM子句：

with sample_pop as (
select *
from mytable a cross join
(select percentile_approx(transaction_amount, 0.99) as top1
from mytable
) aa
where a.transaction_amount > aa.top1
)
select * 
from sample_pop distribute by rand(1) 
order by rand(1)
limit 200;

通过以下查询解决了我的问题 -

with sample_pop as (select a.* from 
(
select *, cum_dist() over (order by transaction_amount asc) pct
from mytable
) a
where pct >= 0.99
)
select * 
from sample_pop distribute by rand(1) 
order by rand(1)
limit 200;

相关内容

最新更新

热门标签：