我有一个结构像这样的大表(value: integer
;created_at: timestamp with time zone
)
| val | created_at |
|:----|:--------------------|
| 1 | 2021-01-05 13:07:23 |
| 2 | 2021-01-05 13:07:29 |
| 1 | 2021-01-05 13:07:39 |
| 1 | 2021-01-05 13:07:40 |
| 1 | 2021-01-05 13:07:41 |
| 1 | 2021-01-05 13:07:43 |
| 2 | 2021-01-05 13:07:46 |
| 2 | 2021-01-05 13:07:48 |
| 2 | 2021-01-05 13:07:52 |
| 1 | 2021-01-05 13:07:54 |
| 1 | 2021-01-05 13:07:58 |
| 1 | 2021-01-05 13:07:59 |
| 2 | 2021-01-05 13:08:01 |
| 1 | 2021-01-05 13:08:04 |
我想将具有相同值的记录分组,并为每个间隔获得最低和最高的created_at
。
我还需要正确处理只有1条记录的间隔(如在表的开始和结束处)
| val | created_from | created_to |
|:----|:--------------------|:--------------------|
| 1 | 2021-01-05 13:07:23 | 2021-01-05 13:07:23 |
| 2 | 2021-01-05 13:07:29 | 2021-01-05 13:07:29 |
| 1 | 2021-01-05 13:07:39 | 2021-01-05 13:07:43 |
| 2 | 2021-01-05 13:07:46 | 2021-01-05 13:07:52 |
| 1 | 2021-01-05 13:07:54 | 2021-01-05 13:07:59 |
| 2 | 2021-01-05 13:08:01 | 2021-01-05 13:08:01 |
| 1 | 2021-01-05 13:08:04 | 2021-01-05 13:08:04 |
我如何在1个SQL查询中实现这一点?
首先,我们需要消除每个常量间隔中间的记录。我们使用lead和lag函数获得与下一个或前一个记录相比具有不同val
的所有记录
select
*
from
(select
lag(val) over (order by created_at) prev_val,
val,
lead(val) over (order by created_at) next_val,
created_at
from tbl
order by created_at) t
where (
prev_val isnull
or next_val isnull
or val != next_val
or val != prev_val
)
| prev_val | val | next_val | created_at |
|:---------|:----|:---------|:--------------------|
| NULL | 1 | 2 | 2021-01-05 13:07:23 |
| 1 | 2 | 1 | 2021-01-05 13:07:29 |
| 2 | 1 | 1 | 2021-01-05 13:07:39 |
| 1 | 1 | 2 | 2021-01-05 13:07:43 |
| 1 | 2 | 2 | 2021-01-05 13:07:46 |
| 2 | 2 | 1 | 2021-01-05 13:07:52 |
| 2 | 1 | 1 | 2021-01-05 13:07:54 |
| 1 | 1 | 2 | 2021-01-05 13:07:59 |
| 1 | 2 | 1 | 2021-01-05 13:08:01 |
| 2 | 1 | NULL | 2021-01-05 13:08:04 |
这里我们有每个常数区间的端点。我们也有单点间隔的记录。
接下来,我们需要结合每个区间的左右两端。
select
val,
(case when prev_val_2 = val then prev_created_at_2 else created_at end) as created_from,
created_at as created_to
from
(select
lag(val) over (order by created_at) prev_val_2,
val,
lead(val) over (order by created_at) next_val_2,
lag(created_at) over (order by created_at) prev_created_at_2,
created_at
from
(select
lag(val) over (order by created_at) prev_val,
val,
lead(val) over (order by created_at) next_val,
created_at
from tbl
order by created_at) t
where (
(prev_val isnull
or next_val isnull
or val != next_val
or val != prev_val)
)) tt
where val != next_val_2 or next_val_2 isnull
| val | created_from | created_to |
|:----|:--------------------|:--------------------|
| 1 | 2021-01-05 13:07:23 | 2021-01-05 13:07:23 |
| 2 | 2021-01-05 13:07:29 | 2021-01-05 13:07:29 |
| 1 | 2021-01-05 13:07:39 | 2021-01-05 13:07:43 |
| 2 | 2021-01-05 13:07:46 | 2021-01-05 13:07:52 |
| 1 | 2021-01-05 13:07:54 | 2021-01-05 13:07:59 |
| 2 | 2021-01-05 13:08:01 | 2021-01-05 13:08:01 |
| 1 | 2021-01-05 13:08:04 | 2021-01-05 13:08:04 |
(case when prev_val_2 = val then prev_created_at_2 else created_at end)
表达解释:
我们有两种情况:像这样的单点区间
| prev_val | val | next_val | created_at |
|:---------|:----|:---------|:--------------------|
| NULL | 1 | 2 | 2021-01-05 13:07:23 |
| 1 | 2 | 1 | 2021-01-05 13:07:29 | <--
和这样的多点间隔
| prev_val | val | next_val | created_at |
|:---------|:----|:---------|:--------------------|
| 2 | 1 | 1 | 2021-01-05 13:07:39 | <--
| 1 | 1 | 2 | 2021-01-05 13:07:43 | <--
条件lag(val) = val
(prev_val_2 = val
)代表第二种情况(1=1
),因此这里我们将lag(created_at)
(prev_created_at_2
)作为created_from
:2021-01-05 13:07:39
。
条件lag(val) != val
(1!=2
)表示第一种情况,因此这里我们将created_at
同时作为created_to
和created_from
:2021-01-05 13:07:29