我有这个数据(样本(:
event_id period_start period_end rating
100269 2/8/2016 6/30/2016 1
100269 6/30/2016 12/31/2016 1
100269 12/31/2016 6/30/2017 2
100269 6/30/2017 12/31/2017 2
当句点(period_start
、period_end
(立即连续且评级相同时,我想"压缩"行。 所需的输出为:
event_id period_start period_end rating
100269 2/8/2016 12/31/2016 1
100269 12/31/2016 12/31/2017 2
请注意,在此数据集中,并非所有周期对于某些event_id
都是直接连续的。 下面是一个示例和所需的输出:
event_id period_start period_end rating
100300 2/8/2016 6/30/2016 1
100300 6/30/2016 12/31/2016 1
100300 6/30/2017 12/31/2017 1
期望输出:
event_id period_start period_end rating
100300 2/8/2016 12/31/2016 1
100300 6/30/2017 12/31/2017 1
您可以通过测试前一行的period_end
是否等于当前行的period_start
来确定句点是否直接连续(在整个数据集中都是如此,以识别直接连续的句点(。
我认为这里有一个涉及GROUP BY
的解决方案,但我没有看到它。 任何帮助都会很棒。 谢谢!
with a as (
select *,
case when lag(period_end) over (partition by event_id, rating order by period_start) = period_start
then 0 else 1 end as brk
from T
) b as (
select *,
sum(brk) over (partition by event_id, rating order by period_start) as grp
from a
)
select event_id, min(period_start) as period_start, max(period_end) as period_end, rating
from b
group by event_id, grp, rating
order by event_id, grp, rating
确定序列中的哪些行是分隔符,并将它们标记为 1。通过计算休息时间(运行总数(对组进行编号。使用group by
折叠为单行。
这是一个差距和孤岛问题。 关键思想是使用lag()
查找值更改的位置,然后执行累积总和来分配组。
但是,我更喜欢日期列而不是值列的滞后。 事实证明,当您有多个可能更改的值时,这要方便得多。
在您的情况下,这看起来像:
select event_id, min(period_start), max(period_end), rating
from (select t.*,
sum(case when prev_period_end = period_end then 0 else 1 end) over (partition by event_id order by period_start) as grp
from (select t.*,
lag(period_end) over (partition by event_id, rating order by period_start) as prev_period_end
from t
) t
) t
group by event_id, rating, grp;