SQL (Presto) - 当日期范围连续时"压缩"行



我有这个数据(样本(:

event_id    period_start    period_end  rating
100269      2/8/2016        6/30/2016   1
100269      6/30/2016       12/31/2016  1
100269      12/31/2016      6/30/2017   2
100269      6/30/2017       12/31/2017  2

当句点(period_startperiod_end(立即连续且评级相同时,我想"压缩"行。 所需的输出为:

event_id    period_start    period_end  rating
100269      2/8/2016        12/31/2016  1
100269      12/31/2016      12/31/2017  2

请注意,在此数据集中,并非所有周期对于某些event_id都是直接连续的。 下面是一个示例和所需的输出:

event_id    period_start    period_end  rating
100300      2/8/2016        6/30/2016   1
100300      6/30/2016       12/31/2016  1
100300      6/30/2017       12/31/2017  1

期望输出:

event_id    period_start    period_end  rating
100300      2/8/2016        12/31/2016  1
100300      6/30/2017       12/31/2017  1

您可以通过测试前一行的period_end是否等于当前行的period_start来确定句点是否直接连续(在整个数据集中都是如此,以识别直接连续的句点(。

我认为这里有一个涉及GROUP BY的解决方案,但我没有看到它。 任何帮助都会很棒。 谢谢!

with a as (
select *,
case when lag(period_end) over (partition by event_id, rating order by period_start) = period_start
then 0 else 1 end as brk
from T
) b as (
select *,
sum(brk) over (partition by event_id, rating order by period_start) as grp
from a
)
select event_id, min(period_start) as period_start, max(period_end) as period_end, rating
from b
group by event_id, grp, rating
order by event_id, grp, rating

确定序列中的哪些行是分隔符,并将它们标记为 1。通过计算休息时间(运行总数(对组进行编号。使用group by折叠为单行。

这是一个差距和孤岛问题。 关键思想是使用lag()查找值更改的位置,然后执行累积总和来分配组。

但是,我更喜欢日期列而不是列的滞后。 事实证明,当您有多个可能更改的值时,这要方便得多。

在您的情况下,这看起来像:

select event_id, min(period_start), max(period_end), rating
from (select t.*,
sum(case when prev_period_end = period_end then 0 else 1 end) over (partition by event_id order by period_start) as grp
from (select t.*,
lag(period_end) over (partition by event_id, rating order by period_start) as prev_period_end
from t
) t
) t
group by event_id, rating, grp;

最新更新