在 SQL 中查找第一个目击

我们在 Spark sql 表中有一个时间序列，它描述了用户每次执行事件的时间。

但是，用户倾向于突发执行许多事件。我想找到这些爆发中每个人的最短时间。

不幸的是，这是历史数据，所以我无法更改表的创建方式。所以我基本上想要一个select min(time_), user from my_table group by user，但每次爆发。任何帮助将不胜感激！

编辑：

一些示例数据是：

user time_ 0 10 0 11 2 12 0 12 2 13 2 15 0 83 0 84 0 85

因此，例如，在上面的数据中，我想找到(0,10(，(2,12(和(0,83(。我们可以说，如果在 1 小时内(在上面的示例数据中为 60(，就会发生突发。

如果这是您唯一需要的信息：

select      user
           ,time_
from       (select      user
                       ,time_
                       ,case when time_ - lag (time_,1,time_-60) over (partition by user order by time_) >= 60 then 'Y' else null end  as burst
            from        my_table 
            ) t
where       burst = 'Y'
;

user    time_
0       10
0       83
2       12

如果您需要收集有关每个突发的一些其他信息：

select      user
           ,burst_seq
           ,min (time_) as min_time_
           ,max (time_) as max_time_
           ,count (*)   as events_num
from       (select      user
                       ,time_
                       ,count(burst) over 
                        (
                            partition by    user 
                            order by        time_  
                            rows unbounded preceding
                        ) + 1                           as burst_seq
            from       (select      user
                                   ,time_
                                   ,case when time_ - lag (time_) over (partition by user order by time_) >= 60 then 'Y' else null end as burst
                        from        my_table 
                        ) t
            ) t
group by    user
           ,burst_seq
;

user    burst_seq   min_time_   max_time_   events_num
0       1           10          12          3
0       2           83          85          3
2       1           12          15          3

附言CASE 语句似乎有一个错误。
case when ... then 'Y' end产生失败： 索引出界异常索引： 2，大小： 2 尽管它是一种合法语法。
添加else null解决了它。

相关内容

最新更新

热门标签：