postgreSQL对交错/重叠的行进行分组



我在Postgres中有下面的表,它在a_snob_sno两列中有重叠的数据。

create table data
( a_sno integer not null,  
b_sno integer not null,
PRIMARY KEY (a_sno,b_sno)
);
insert into data (a_sno,b_sno) values
( 4, 5 )
, ( 5, 4 )
, ( 5, 6 )
, ( 6, 5 )
, ( 6, 7 )
, ( 7, 6 )
, ( 9, 10)
, ( 9, 13)
, (10, 9 )
, (13, 9 )
, (10, 13)
, (13, 10)
, (10, 14)
, (14, 10)
, (13, 14)
, (14, 13)
, (11, 15)
, (15, 11);

从前6行可以看出,两列中的数据值4、5、6和7相交/重叠,需要划分为一个组。行7-16和行17-18也是如此,它们将分别标记为组2和组3。

结果输出应该是这样的:

group | value
------+------
1     | 4
1     | 5
1     | 6
1     | 7
2     | 9
2     | 10
2     | 13
2     | 14
3     | 11
3     | 15

假设所有对都存在于它们的镜像组合中以及(4,5)(5,4)中。但以下解决方案在没有镜像重复的情况下也能正常工作。

简单案例

所有连接都可以按单个升序排列,不可能出现像我在fiddle中添加的复杂情况,我们可以在rCTE:中使用此解决方案而不重复

我首先得到每个组的最小a_sno,以及最小相关的b_sno:

SELECT row_number() OVER (ORDER BY a_sno) AS grp
, a_sno, min(b_sno) AS b_sno
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
SELECT 1 FROM data
WHERE  b_sno = d.a_sno
AND    a_sno < b_sno
)
GROUP  BY a_sno;

这只需要一个单一的查询级别,因为窗口函数可以在聚合上构建:

  • 获取联接表列的不同和

结果:

grp  a_sno  b_sno
1    4      5
2    9      10
3    11     15

我避免分支和重复(相乘)行-长链可能会使的成本高得多。我在相关子查询中使用ORDER BY b_sno LIMIT 1,使其在递归CTE中运行。

  • 在非唯一列上创建唯一索引

性能的关键是匹配索引,它已经由PK约束PRIMARY KEY (a_sno,b_sno)提供:而不是反过来(b_sno, a_sno):

  • 复合索引是否也适用于第一个字段的查询

WITH RECURSIVE t AS (
SELECT row_number() OVER (ORDER BY d.a_sno) AS grp
, a_sno, min(b_sno) AS b_sno  -- the smallest one
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
SELECT 1 FROM data
WHERE  b_sno = d.a_sno
AND    a_sno < b_sno
)
GROUP  BY a_sno
)
, cte AS (
SELECT grp, b_sno AS sno FROM t
UNION ALL
SELECT c.grp
, (SELECT b_sno  -- correlated subquery
FROM   data
WHERE  a_sno = c.sno
AND    a_sno < b_sno
ORDER  BY b_sno
LIMIT  1)
FROM   cte  c
WHERE  c.sno IS NOT NULL
)
SELECT * FROM cte
WHERE  sno IS NOT NULL   -- eliminate row with NULL
UNION  ALL               -- no duplicates
SELECT grp, a_sno FROM t
ORDER  BY grp, sno;

不那么简单的案例

从根(最小的sno)开始,可以用一个或多个分支按升序到达所有节点。

这一次,获取所有更大的sno,并在结束时使用UNION消除可能多次访问的重复节点:

WITH RECURSIVE t AS (
SELECT rank() OVER (ORDER BY d.a_sno) AS grp
, a_sno, b_sno  -- get all rows for smallest a_sno
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
SELECT 1 FROM data
WHERE  b_sno = d.a_sno
AND    a_sno < b_sno
)
)   
, cte AS (
SELECT grp, b_sno AS sno FROM t
UNION ALL
SELECT c.grp, d.b_sno
FROM   cte  c
JOIN   data d ON d.a_sno = c.sno
AND d.a_sno < d.b_sno  -- join to all connected rows
)
SELECT grp, sno FROM cte
UNION                     -- eliminate duplicates
SELECT grp, a_sno FROM t  -- add first rows
ORDER  BY grp, sno;

与第一个解决方案不同,我们在这里没有得到最后一行NULL(由相关的子查询引起)。

两者都应该表现得很好,尤其是在长链/多分支的情况下。所需结果:

SQL Fiddle(添加行以展示难度)。

无向图

如果存在无法通过升序遍历从根达到的局部极小值,则上述解决方案将不起作用。在这种情况下,请考虑FarhÉg的解决方案。

我想说另一种方法,它可能很有用,你可以分两步完成:

1.每组取max(sno)

select q.sno,
row_number() over(order by q.sno) gn
from(
select distinct d.a_sno sno
from data d
where not exists (
select b_sno
from data
where b_sno=d.a_sno
and a_sno>d.a_sno
)
)q

结果:

sno gn
7   1
14  2
15  3

2.使用recursive cte查找组中的所有相关成员:

with recursive cte(sno,gn,path,cycle)as(
select q.sno,
row_number() over(order by q.sno) gn,
array[q.sno],false
from(
select distinct d.a_sno sno
from data d
where not exists (
select b_sno
from data
where b_sno=d.a_sno
and a_sno>d.a_sno
)
)q
union all
select d.a_sno,c.gn,
d.a_sno || c.path,
d.a_sno=any(c.path)
from data d
join cte c on d.b_sno=c.sno 
where not cycle
)
select distinct gn,sno from cte
order by gn,sno

结果:

gn  sno
1   4
1   5
1   6
1   7
2   9
2   10
2   13
2   14
3   11
3   15

这是我所做的演示

这里有一个开始,可以给出一些方法的想法。递归查询从每条记录的a_sno开始,然后尝试遵循b_sno的路径,直到它到达末尾或形成一个循环。路径由sno整数的数组表示。

unnest函数将数组分解成行,因此sno值映射到路径数组,例如:

4, {6, 5, 4}

将转换为数组中每个值的一行:

4, 6
4, 5
4, 4

然后,array_agg通过将值聚合回一个路径来反转操作,但去掉重复项并排序。

现在每个CCD_ 21都与一个路径相关联,并且该路径形成分组。CCD_ 22可以用于将分组(集群)映射为数字。

SELECT array_agg(DISTINCT map ORDER BY map) AS cluster
,sno
FROM ( WITH RECURSIVE x(sno, path, cycle) AS (
SELECT a_sno, ARRAY[a_sno], false FROM  data
UNION ALL
SELECT b_sno, path || b_sno, b_sno = ANY(path)
FROM data, x
WHERE a_sno = x.sno
AND NOT cycle
)
SELECT sno, unnest(path) AS map FROM x ORDER BY 1
) y
GROUP BY sno
ORDER BY 1, 2

输出:

cluster    | sno 
--------------+-----
{4,5,6,7}    |   4
{4,5,6,7}    |   5
{4,5,6,7}    |   6
{4,5,6,7}    |   7
{9,10,13,14} |   9
{9,10,13,14} |  10
{9,10,13,14} |  13
{9,10,13,14} |  14
{11,15}      |  11
{11,15}      |  15
(10 rows)

为排名再包装一次:

SELECT dense_rank() OVER(order by cluster) AS rank
,sno
FROM (
SELECT array_agg(DISTINCT map ORDER BY map) AS cluster
,sno
FROM ( WITH RECURSIVE x(sno, path, cycle) AS (
SELECT a_sno, ARRAY[a_sno], false FROM  data
UNION ALL
SELECT b_sno, path || b_sno, b_sno = ANY(path)
FROM data, x
WHERE a_sno = x.sno
AND NOT cycle
)
SELECT sno, unnest(path) AS map FROM x ORDER BY 1
) y
GROUP BY sno
ORDER BY 1, 2
) z

输出:

rank | sno 
------+-----
1 |   4
1 |   5
1 |   6
1 |   7
2 |   9
2 |  10
2 |  13
2 |  14
3 |  11
3 |  15
(10 rows)

相关内容

  • 没有找到相关文章

最新更新