我有一个这样的表:
id | arr_val | grp
-----------------
1 | {10,20} | -
2 | {20,30} | -
3 | {50,5} | -
4 | {30,60} | -
5 | {1,5} | -
6 | {7,6} | -
我想找出一组中的哪些行在一起。在这个例子中,1,2,4是一个组,因为1和2有一个共同的元素,2和4。3和5形成一个组,因为它们具有共同的元件。6与任何人都没有共同之处。所以它自己组成了一个团体。结果应该是这样的:
id | arr_val | grp
-----------------
1 | {10,20} | 1
2 | {20,30} | 1
3 | {50,5} | 2
4 | {30,60} | 1
5 | {1,5} | 2
6 | {7,6} | 3
我想我需要递归cte,因为我的问题是图形化的,但我不确定如何做到这一点。
其他信息和背景:
该表约有2500000行。
事实上,我试图解决的问题有更多的领域和条件来寻找一个群体:
id | arr_val | date | val | grp
---------------------------------
1 | {10,20} | -
2 | {20,30} | -
组中的元素不仅需要通过arr_val中的公共元素进行链接。它们都需要在val中具有相同的值,并且需要通过日期中的时间跨度(间隔和孤岛(进行链接。我解决了另外两个问题,但现在我的问题的条件增加了。如果有一种简单的方法可以在一个查询中同时完成这三个操作,那将是非常棒的,但这并不是必须的。
----编辑-----
虽然这两个答案都适用于五行的例子,但它们不适用于行数更多的表。两个答案都存在递归部分的行数爆炸的问题,并且只在最后减少行数。解决方案也应该适用于这样的数据:
id | arr_val | grp
-----------------
1 | {1} | -
2 | {1} | -
3 | {1} | -
4 | {1} | -
5 | {1} | -
6 | {1} | -
7 | {1} | -
8 | {1} | -
9 | {1} | -
10 | {1} | -
11 | {1} | -
more rows........
这个问题有解决办法吗?
您可以将其作为递归CTE处理。基于公共值定义ID之间的边。然后遍历边缘并聚合:
with recursive nodes as (
select id, val
from t cross join
unnest(arr_val) as val
),
edges as (
select distinct n1.id as id1, n2.id as id2
from nodes n1 join
nodes n2
on n1.val = n2.val
),
cte as (
select id1, id2, array[id1] as visited, 1 as lev
from edges
where id1 = id2
union all
select cte.id1, e.id2, visited || e.id2,
lev + 1
from cte join
edges e
on cte.id2 = e.id1
where e.id2 <> all(cte.visited)
),
vals as (
select id1, array_agg(distinct id2 order by id2) as id2s
from cte
group by id1
)
select *, dense_rank() over (order by id2s) as grp
from vals;
这里有一个db<gt;不停摆弄
这里有一种解决图行走问题的方法:
with recursive cte as (
select id, arr_val, array[id] path from mytable
union all
select t.id, t.arr_val, c.path || t.id
from cte c
inner join mytable t on t.arr_val && c.arr_val and not t.id = any(c.path)
)
select c.id, c.arr_val, dense_rank() over(order by min(x.id)) grp
from cte c
cross join lateral unnest(c.path) as x(id)
group by c.id, c.arr_val
order by c.id
公共表表达式遍历图,递归地查找";相邻的";节点到当前节点,同时跟踪已访问的节点。然后,外部查询聚合,使用每条路径的最小节点来识别组,并最终对组进行排名。
DB Fiddle上的演示:
id|arr_val|grp-:|:------|--:1|{10,20}|12|{20,30}|13|{50.5}|24|{30,60}|15|{1,5}|26|{7,6}|3
虽然Gordon Linoffs解决方案是我发现的最快的解决方案,适用于组不那么大的少量数据,但它不适用于更大的数据集和更大的组。我改变了他的解决方案,使之发挥作用。我将边移动到索引表中:创建表格边缘(id1整数不为空,id2整数不为空,约束staffel_group_nodes_pk主键(id1,id2();
insert into edges(id1, id2) with
nodes(id, arr_val) as (
select id, arr_val
from my_table
)
select n1.id as id1, n2.id as id2
from nodes n1
join
nodes n2
on n1.arr_val && n2.arr_val ;
Tha一个人也没帮上忙。我也更改了他的递归部分:
with recursive
cte as (
select id1, array [id1] as visited
from edges
where id1 = id2
union all
select unnested.id1, array_agg(distinct unnested.vis) as visited
from (
select cte.id1,
unnest(cte.visited || e.id2) as vis
from cte
join
staffel_group_edges e
on e.id1 = any (cte.visited)
and e.id2 <> all (cte.visited)) as unnested
group by unnested.id1
),
vals as (
select id1, array_agg(distinct vis) as id2s
from (
select cte.id1,
unnest(cte.visited) as vis
from cte) as unnested
group by unnested.id1
)
select id1,id2s, dense_rank() over (order by id2s) as grp
from vals;
每走一步,我都会按起点对所有搜索进行分组。这大大减少了平行步行路径的数量,而且速度惊人。