我想在定义的列上为行]获取最重复的值。
例如,我有一张表:
|id |x |y |z |
|---|-----|-----|-----|
|a |green|green|black|
|b |red |green|red |
|c |red |black|black|
|d |red |green|black|
因此输出为:
|id |x |y |z |mode |
|---|-----|-----|-----|-------|
|a |green|green|black|green |
|b |red |green|red |red |
|c |red |black|black|black |
|d |red |green|black|unknown|
在平局的情况下(对于id"d"(,模式将是默认值"未知"。
以下是我解决问题的方法:
with dt (id, x, y, z) as (
values
('a', 'green', 'green', 'black'),
('b', 'red', 'green', 'red'),
('c', 'red', 'black', 'black'),
('d', 'red', 'green', 'black')
), dt_map as (
select
*,
transform_values(
multimap_from_entries(
transform(array[x, y, z], x -> row(x, 1))
),
(k, v) -> reduce(v, 0, (s, x) -> s + x, s -> s)
) as m
from dt
), dt_map_filter as (
select
id,
x,
y,
z,
map_keys(
map_filter(
m,
(k, v) -> v = array_max(map_values(m))
)
) as m
from dt_map
)
select
id,
x,
y,
z,
if(cardinality(m) > 1, 'unknown', element_at(m, 1)) as mode
from dt_map_filter;
这是有效的,但我想知道普雷斯托是否有更好的解决方案。
感谢
另一种方法-创建包含有问题值的数组列,用unnest
将其展平,分组并使用histogram
确定计数,然后提取最常见的值:
-- sample data
WITH dataset(id, x, y, z ) AS (
VALUES ('a', 'green', 'green', 'black'),
('b', 'red', 'green', 'red'),
('c', 'red', 'black', 'black'),
('d', 'red', 'green', 'black')
)
-- query
select id,
x,
y,
z,
if(cardinality(maxXYZ) = 1, maxXYZ [ 1 ], 'unknown') mode -- if multiple values have same frequency - then unknown
from (
select id,
x,
y,
z,
map_keys(
map_filter(
mp,
(k, v)->v = array_max(map_values(mp))
)
) AS maxXYZ
from (
select max(id) id, -- all values are the same so `max` or `min` should work just fine
max(x) x,
max(y) y,
max(z) z,
histogram(v) mp -- map containing counts
from (
select *,
row_number() over() rn -- unique row id to group by later
from dataset
) d,
unnest (array [ x, y, z ]) as t(v)
group by rn
)
)
输出:
id | x | yz | 模式 |
---|---|---|---|
a | 绿色 | 黑色 | 绿色|
红色 | 黑色 | 黑色||
红色 | 绿色 | 红色红色 | |
红色 | 绿色黑色未知