Presto-按行获取模式值



我想在定义的列上为行]获取最重复的值。

例如,我有一张表:

|id |x    |y    |z    |
|---|-----|-----|-----|
|a  |green|green|black|
|b  |red  |green|red  |
|c  |red  |black|black|
|d  |red  |green|black|

因此输出为:

|id |x    |y    |z    |mode   |
|---|-----|-----|-----|-------|
|a  |green|green|black|green  |
|b  |red  |green|red  |red    |
|c  |red  |black|black|black  |
|d  |red  |green|black|unknown|

在平局的情况下(对于id"d"(,模式将是默认值"未知"。

以下是我解决问题的方法:

with dt (id, x, y, z) as (
values
('a', 'green', 'green', 'black'),
('b', 'red', 'green', 'red'),
('c', 'red', 'black', 'black'), 
('d', 'red', 'green', 'black')
), dt_map as (
select 
*,
transform_values(
multimap_from_entries(
transform(array[x, y, z], x -> row(x, 1))
),
(k, v) -> reduce(v, 0, (s, x) -> s + x, s -> s)
) as m
from dt
), dt_map_filter as (
select 
id, 
x, 
y, 
z,
map_keys(
map_filter(
m,
(k, v) -> v = array_max(map_values(m)) 
)
) as m
from dt_map
)
select 
id, 
x, 
y, 
z,
if(cardinality(m) > 1, 'unknown', element_at(m, 1)) as mode
from dt_map_filter;

这是有效的,但我想知道普雷斯托是否有更好的解决方案。

感谢

另一种方法-创建包含有问题值的数组列,用unnest将其展平,分组并使用histogram确定计数,然后提取最常见的值:

-- sample data
WITH dataset(id, x, y, z ) AS (
VALUES ('a', 'green', 'green', 'black'),
('b', 'red',   'green', 'red'),
('c', 'red',   'black', 'black'),
('d', 'red',   'green', 'black')
)
-- query
select id,
x,
y,
z,
if(cardinality(maxXYZ) = 1, maxXYZ [ 1 ], 'unknown') mode -- if multiple values have same frequency - then unknown  
from (
select id,
x,
y,
z,
map_keys(
map_filter(
mp,
(k, v)->v = array_max(map_values(mp))
)
) AS maxXYZ
from (
select max(id) id, -- all values are the same so `max` or `min` should work just fine
max(x) x,
max(y) y,
max(z) z,
histogram(v) mp -- map containing counts
from (
select *,
row_number() over() rn -- unique row id to group by later
from dataset
) d,
unnest (array [ x, y, z ]) as t(v)
group by rn
)
)

输出:

y绿色黑色红色绿色黑色未知
idxz模式
a绿色黑色
红色黑色
红色绿色红色
红色

最新更新