在PostgreSQL上按相同值分组时间序列数据



我有一个结构像这样的大表(value: integer;created_at: timestamp with time zone)

| val | created_at          |
|:----|:--------------------|
| 1   | 2021-01-05 13:07:23 |
| 2   | 2021-01-05 13:07:29 |
| 1   | 2021-01-05 13:07:39 |
| 1   | 2021-01-05 13:07:40 |
| 1   | 2021-01-05 13:07:41 |
| 1   | 2021-01-05 13:07:43 |
| 2   | 2021-01-05 13:07:46 |
| 2   | 2021-01-05 13:07:48 |
| 2   | 2021-01-05 13:07:52 |
| 1   | 2021-01-05 13:07:54 |
| 1   | 2021-01-05 13:07:58 |
| 1   | 2021-01-05 13:07:59 |
| 2   | 2021-01-05 13:08:01 |
| 1   | 2021-01-05 13:08:04 |

我想将具有相同值的记录分组,并为每个间隔获得最低和最高的created_at

我还需要正确处理只有1条记录的间隔(如在表的开始和结束处)

| val | created_from        | created_to          |
|:----|:--------------------|:--------------------|
| 1   | 2021-01-05 13:07:23 | 2021-01-05 13:07:23 |
| 2   | 2021-01-05 13:07:29 | 2021-01-05 13:07:29 |
| 1   | 2021-01-05 13:07:39 | 2021-01-05 13:07:43 |
| 2   | 2021-01-05 13:07:46 | 2021-01-05 13:07:52 |
| 1   | 2021-01-05 13:07:54 | 2021-01-05 13:07:59 |
| 2   | 2021-01-05 13:08:01 | 2021-01-05 13:08:01 |
| 1   | 2021-01-05 13:08:04 | 2021-01-05 13:08:04 |

我如何在1个SQL查询中实现这一点?

首先,我们需要消除每个常量间隔中间的记录。我们使用lead和lag函数获得与下一个或前一个记录相比具有不同val的所有记录

select
*
from
(select
lag(val) over (order by created_at) prev_val,
val,
lead(val) over (order by created_at) next_val,
created_at
from tbl
order by created_at) t
where (
prev_val isnull
or next_val isnull
or val != next_val
or val != prev_val
)
| prev_val | val | next_val | created_at          |
|:---------|:----|:---------|:--------------------|
| NULL     | 1   | 2        | 2021-01-05 13:07:23 |
| 1        | 2   | 1        | 2021-01-05 13:07:29 |
| 2        | 1   | 1        | 2021-01-05 13:07:39 |
| 1        | 1   | 2        | 2021-01-05 13:07:43 |
| 1        | 2   | 2        | 2021-01-05 13:07:46 |
| 2        | 2   | 1        | 2021-01-05 13:07:52 |
| 2        | 1   | 1        | 2021-01-05 13:07:54 |
| 1        | 1   | 2        | 2021-01-05 13:07:59 |
| 1        | 2   | 1        | 2021-01-05 13:08:01 |
| 2        | 1   | NULL     | 2021-01-05 13:08:04 |

这里我们有每个常数区间的端点。我们也有单点间隔的记录。

接下来,我们需要结合每个区间的左右两端。

select
val,
(case when prev_val_2 = val then prev_created_at_2 else created_at end) as created_from,
created_at as created_to
from
(select
lag(val) over (order by created_at) prev_val_2,
val,
lead(val) over (order by created_at) next_val_2,
lag(created_at) over (order by created_at) prev_created_at_2,
created_at
from
(select
lag(val) over (order by created_at) prev_val,
val,
lead(val) over (order by created_at) next_val,
created_at
from tbl
order by created_at) t
where (
(prev_val isnull
or next_val isnull
or val != next_val
or val != prev_val)
)) tt
where val != next_val_2 or next_val_2 isnull
| val | created_from        | created_to          |
|:----|:--------------------|:--------------------|
| 1   | 2021-01-05 13:07:23 | 2021-01-05 13:07:23 |
| 2   | 2021-01-05 13:07:29 | 2021-01-05 13:07:29 |
| 1   | 2021-01-05 13:07:39 | 2021-01-05 13:07:43 |
| 2   | 2021-01-05 13:07:46 | 2021-01-05 13:07:52 |
| 1   | 2021-01-05 13:07:54 | 2021-01-05 13:07:59 |
| 2   | 2021-01-05 13:08:01 | 2021-01-05 13:08:01 |
| 1   | 2021-01-05 13:08:04 | 2021-01-05 13:08:04 |

(case when prev_val_2 = val then prev_created_at_2 else created_at end)表达解释:

我们有两种情况:像这样的单点区间

| prev_val | val | next_val | created_at          |
|:---------|:----|:---------|:--------------------|
| NULL     | 1   | 2        | 2021-01-05 13:07:23 |
| 1        | 2   | 1        | 2021-01-05 13:07:29 |  <--

和这样的多点间隔

| prev_val | val | next_val | created_at          |
|:---------|:----|:---------|:--------------------|
| 2        | 1   | 1        | 2021-01-05 13:07:39 |  <--
| 1        | 1   | 2        | 2021-01-05 13:07:43 |  <--

条件lag(val) = val(prev_val_2 = val)代表第二种情况(1=1),因此这里我们将lag(created_at)(prev_created_at_2)作为created_from:2021-01-05 13:07:39

条件lag(val) != val(1!=2)表示第一种情况,因此这里我们将created_at同时作为created_tocreated_from:2021-01-05 13:07:29

最新更新