我们在 Spark sql 表中有一个时间序列,它描述了用户每次执行事件的时间。
但是,用户倾向于突发执行许多事件。我想找到这些爆发中每个人的最短时间。
不幸的是,这是历史数据,所以我无法更改表的创建方式。所以我基本上想要一个select min(time_), user from my_table group by user
,但每次爆发。任何帮助将不胜感激!
编辑:
一些示例数据是:
user time_
0 10
0 11
2 12
0 12
2 13
2 15
0 83
0 84
0 85
因此,例如,在上面的数据中,我想找到(0,10(,(2,12(和(0,83(。我们可以说,如果在 1 小时内(在上面的示例数据中为 60(,就会发生突发。
如果这是您唯一需要的信息:
select user
,time_
from (select user
,time_
,case when time_ - lag (time_,1,time_-60) over (partition by user order by time_) >= 60 then 'Y' else null end as burst
from my_table
) t
where burst = 'Y'
;
user time_
0 10
0 83
2 12
如果您需要收集有关每个突发的一些其他信息:
select user
,burst_seq
,min (time_) as min_time_
,max (time_) as max_time_
,count (*) as events_num
from (select user
,time_
,count(burst) over
(
partition by user
order by time_
rows unbounded preceding
) + 1 as burst_seq
from (select user
,time_
,case when time_ - lag (time_) over (partition by user order by time_) >= 60 then 'Y' else null end as burst
from my_table
) t
) t
group by user
,burst_seq
;
user burst_seq min_time_ max_time_ events_num
0 1 10 12 3
0 2 83 85 3
2 1 12 15 3
附言CASE 语句似乎有一个错误。
case when ... then 'Y' end
产生失败: 索引出界异常索引: 2, 大小: 2 尽管它是一种合法语法。
添加else null
解决了它。