MySQL:如何将范围的分区组合成尽可能大的连续范围



我一直在尝试完成一个相当复杂的SQL查询(也许很简单?(来压缩一个包含重复信息的表。我在SequelPro中使用MySQL 5.7.14。我是一个新手SQL用户,对连接,联合等有基本的了解。我认为这个需要带有一些组bys的子查询,但我不知道如何做到最好。 下表说明了我正在尝试执行的操作的一个简单示例:

桌子

对于每个重复col_1条目,当由 col_2 和 3(分别为范围的开始和结束(设置的范围重叠时,我想压缩为单个条目。对于 col_4 和 5,应报告此范围内的条目中的最大值。在上面的例子中,在col_1中,有三个重叠的范围,我想将其压缩为 col_1 的最小值和 col_2 的最大值,最大值为 col_4 和 5。对于col_2中的"b",有两个范围(31-50、12-15(不重叠,因此它将按原样返回两行。对于 c,它将返回一行范围为 100-300,值分别为 3、2 col_4和 col_5。此示例所需的完整结果如下所示:

查询输出

我应该补充一点,在某些地方有"空"值,应该被视为零。 有没有人知道最好,最简单的方法? 提前谢谢你!!

更新:我尝试使用建议的范围设置查询,但出现错误。查询如下:

WITH a AS (SELECT range 
, lower(col_2) AS startdate
, max(upper(col_3)) OVER (ORDER BY range) AS `end`
FROM   `combine`
)
, b AS (
SELECT *, lag(`end`) OVER (ORDER BY range) < `start` OR NULL AS step
FROM   a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY range) AS grp
FROM   b
)
SELECT daterange(min(`start`), max(`end`)) AS range
FROM   c
GROUP  BY grp
ORDER  BY 1;

我收到的错误是:您的 SQL 语法有错误;请查看与您的 MySQL 服务器版本对应的手册,了解在"a AS (SELECT 范围("附近使用的正确语法 ,较低的(col_2( AS 开始日期 ,最大(上(col_3(( OVE' 在第 1 行

这并非易事,但可以在单个查询中完成。

困难的部分是将一组间隔组合成尽可能大的连续间隔。这篇文章详细介绍了解决方案。

要获得您想要的结果,您现在需要:

  1. 使用链接中给出的查询计算 col1 中每个值的最大可能连续间隔。

根据您的示例值,结果将是:

col_1 lower_bound upper_bound
a     20          60
b     12          15
b     31          50
c     100         300
  1. 将这些大间隔之一与your_table中的每一行相关联。每行只能有一个这样的间隔,所以让我们INNER JOIN

    SELECT my_table.*, large_intervals.lower_bound, large_intervals.upper_bound FROM my_table INNER JOIN (my_awesome_query(your_table)) large_intervals ON large_intervals.col1 = my_table.col1 AND large_intervals.lower_bound <= my_table.col2 AND large_intervals.upper_bound >= my_table.col3

你会得到:

col1 col2 col3 col4 col5 lower_bound upper_bound
a    45   50   1    0    20          60
a    50   61   6    0    20          60
a    20   45   0    5    20          60
b    31   50   0    1    31          50
b    12   15   5    0    12          15
c    100  200  3    2    100         300
c    150  300  1    2    100         300
  1. 那么很容易,只需按 col1、lower_bound、上限分组:

SELECT col1, lower_bound AS col2, upper_bound AS col3, MAX(col4) AS col4, MAX(col5) AS col5 FROM (query above) decorated_table GROUP BY col1, lower_bound, upper_bound

你会得到你所追求的结果。

回到困难的部分:上面提到的帖子公开了PostgreSQL的解决方案。MySQL没有范围类型,但解决方案可以调整。例如,直接使用下限col2,而不是lower(range)。该解决方案还利用了窗口函数,即laglead,但 MySQL 支持这样做,语法相同,所以这里没有问题。另请注意,它们使用COALESCE(upper(range), 'infinity')来防止未绑定的范围。由于您的范围是有限的,因此您无需关心此,您可以直接使用上限范围,即col3.以下是改编:

WITH a AS (
SELECT
col2,
col3,
col2 AS lower_bound, 
MAX(col3) OVER (ORDER BY col2, col3) AS upper_bound
FROM   combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM   a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY col2, col3) AS grp
FROM   b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM   c
GROUP  BY grp
ORDER  BY 1;

这适用于单个组。如果你想通过col1获取范围,你可以像这样调整它:

WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound, 
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM   combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM   a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM   b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM   c
GROUP  BY col1, grp
ORDER  BY 1;

结合所有内容,我们得到以下内容(在您提供的示例上进行测试(完全返回您预期的输出:

WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound, 
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM   combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM   a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM   b
)
, large_intervals AS (
SELECT
col1,
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS upper_bound
FROM   c
GROUP  BY col1, grp
ORDER  BY 1
)
, combine_with_large_interval AS (
SELECT
combine.*,
large_intervals.lower_bound,
large_intervals.upper_bound
FROM combine
INNER JOIN large_intervals
ON large_intervals.col1 = combine.col1
AND large_intervals.lower_bound <= combine.col2
AND large_intervals.upper_bound >= combine.col3
)
SELECT
col1,
lower_bound AS col2,
upper_bound AS col3, 
MAX(col4) AS col4, 
MAX(col5) AS col5
FROM combine_with_large_interval
GROUP BY col1, lower_bound, upper_bound
ORDER BY col1, col2, col3;

瞧!

最新更新