如何对数组中的值进行排序,聚合,有效地计算SQL中相似记录的方法



我需要对表中的类型属性进行计数和聚合。

这些是"Title">的一些例子列:

<表类> 行标题 1 WATLINGTON STATION - 9住宅及工业单位 2 STATION ROAD YATE - 1平房&2房屋 3 THE OLD VICARAGE CARR LANE - HOUSE & &;稳定块

显示重新排序和聚合的示例想法是首先将数组平铺,然后在聚合时重新排序。最后,一旦它们按顺序排列好,就可以使用简单分组,这将把以前像['a','b'], ['b','a']这样的所有数组分组到一个组中。

with cte (category, frequency) as (
select array_construct('HOUSES'),12432 union all
select array_construct('FLATS'),76736 union all
select array_construct('HOUSE'),23865 union all
select array_construct('APARTMENTS'),18814 union all
select array_construct('HOUSES', 'FLATS'),3245 union all
select array_construct('FLATS', 'HOUSES'),1342 union all
select array_construct('FLATS', 'RETAIL UNIT'),362 union all
select array_construct('FLATS', 'HOUSE'),461 union all
select array_construct('FLATS', 'RETAIL'),890 union all
select array_construct('FLATS', 'RETAIL UNITS'),287
), cte_1 as
(select seq,value,c1.frequency
from cte c1,
lateral flatten(input=>category)
order by seq, value
), cte_2 as
(select array_agg(value) category, frequency
from cte_1
group by seq,frequency)
select category, sum(frequency)
from cte_2
group by category;
tbody> <<tr>
CATEGORYSUM(FREQUENCY)
["FLATS","HOUSE")461
[" flat ";RETAIL"])890
["HOUSE")23865
["FLATS")76736
[" flat ";零售单位";])287
["APARTMENTS")18814
[" flat ";零售单位"])362
["HOUSES")12432
["FLATS","HOUSES")4587

我真的很喜欢Panka的答案,但是ORDER BY不应该在CTE的步骤中,应该在array_agg中,因为a)有一个命令来解决这个问题b) CTE的顺序不重要,依赖于行顺序是一个会烧你的魔法。

CTE2和CTE也可以合并成一个步骤,如果你别名FLATEN,你可以看到发生了什么:

with fake_data_table (category, frequency) as (
select split(column1, '|'), column2 
from values 
('HOUSES', 12432),
('FLATS', 76736),
('HOUSE', 23865),
('APARTMENTS', 18814),
('HOUSES|FLATS', 3245),
('FLATS|HOUSES', 1342),
('FLATS|RETAIL UNIT', 362),
('FLATS|HOUSE', 461),
('FLATS|RETAIL', 890),
('FLATS|RETAIL UNITS', 287)
), part_a as (
select array_agg(f.value) within group (order by f.seq, f.value) as category, 
d.frequency
from fake_data_table as d,
lateral flatten(input=>category) f
group by f.seq, d.frequency
)
select category, sum(frequency)
from part_a
group by category 
order by 1;

给:

tbody> <<tr>
CATEGORYSUM(FREQUENCY)
["APARTMENTS")18814
["FLATS")76736
[" flat ", "HOUSE"])461
["FLATS";;)4587
[" flat ";RETAIL"])890
[" flat ";零售单位"])362
[" flat ";零售单位";])287
["HOUSE")23865
["HOUSES")12432

最新更新