Redshift SQL - Count组内重复值的序列



我有一个表,看起来像这样:

| id |      date_start     |    gap_7_days   |
| -- | ------------------- | --------------- |
|  1 | 2021-06-10 00:00:00 |        0        |
|  1 | 2021-06-13 00:00:00 |        0        |
|  1 | 2021-06-19 00:00:00 |        0        |
|  1 | 2021-06-27 00:00:00 |        0        |
|  2 | 2021-07-04 00:00:00 |        1        |
|  2 | 2021-07-11 00:00:00 |        1        |
|  2 | 2021-07-18 00:00:00 |        1        |
|  2 | 2021-07-25 00:00:00 |        1        |
|  2 | 2021-08-01 00:00:00 |        1        |
|  2 | 2021-08-08 00:00:00 |        1        |
|  2 | 2021-08-09 00:00:00 |        0        |
|  2 | 2021-08-16 00:00:00 |        1        |
|  2 | 2021-08-23 00:00:00 |        1        |
|  2 | 2021-08-30 00:00:00 |        1        |
|  2 | 2021-08-31 00:00:00 |        0        |
|  2 | 2021-09-01 00:00:00 |        0        |
|  2 | 2021-08-08 00:00:00 |        1        |
|  2 | 2021-08-15 00:00:00 |        1        |
|  2 | 2021-08-22 00:00:00 |        1        |
|  2 | 2021-08-23 00:00:00 |        1        |

对于每个ID,我检查连续的date_start值是否相隔7天,并相应地在gap_7_days中输入1或0。

我想做以下事情(仅使用Redshift SQL):

  1. 获取gap_7_days中每个ID
  2. 连续1序列的长度预期输出:

| id |      date_start     |    gap_7_days   | sequence_length |
| -- | ------------------- | --------------- | --------------- |
|  1 | 2021-06-10 00:00:00 |        0        |                 |
|  1 | 2021-06-13 00:00:00 |        0        |                 |
|  1 | 2021-06-19 00:00:00 |        0        |                 |
|  1 | 2021-06-27 00:00:00 |        0        |                 |
|  2 | 2021-07-04 00:00:00 |        1        |        6        |
|  2 | 2021-07-11 00:00:00 |        1        |        6        |
|  2 | 2021-07-18 00:00:00 |        1        |        6        |
|  2 | 2021-07-25 00:00:00 |        1        |        6        |
|  2 | 2021-08-01 00:00:00 |        1        |        6        |
|  2 | 2021-08-08 00:00:00 |        1        |        6        |
|  2 | 2021-08-09 00:00:00 |        0        |                 |
|  2 | 2021-08-16 00:00:00 |        1        |        3        |
|  2 | 2021-08-23 00:00:00 |        1        |        3        |
|  2 | 2021-08-30 00:00:00 |        1        |        3        |
|  2 | 2021-08-31 00:00:00 |        0        |                 |
|  2 | 2021-09-01 00:00:00 |        0        |                 |
|  2 | 2021-08-08 00:00:00 |        1        |        4        |
|  2 | 2021-08-15 00:00:00 |        1        |        4        |
|  2 | 2021-08-22 00:00:00 |        1        |        4        |
|  2 | 2021-08-23 00:00:00 |        1        |        4        |
  1. 获取每个ID的序列数
  2. 预期输出:

| id |    num_sequences    |
| -- | ------------------- |
|  1 |          0          |
|  2 |          3          |

我怎样才能做到这一点?

如果您想要序列的数量,只需查看前面的值。当当前值为"1"时;前一个是NULL0,那么你有一个新的序列。

:

select id,
sum( (gap_7_days = 1 and coalesce(prev_gap_7_days, 0) = 0)::int ) as num_sequences
from (select t.*,
lag(gap_7_days) over (partition by id order by date_start) as prev_gap_7_days
from t
) t
group by id;

如果您确实想要序列的长度,如在中间结果中,则询问new问题。这个问题不需要这些信息。

相关内容

  • 没有找到相关文章