在Spark中查找最长的连续条纹



我有一个类似的单列数据帧

------------
date     
------------
01/01/2020       
02/01/2020  
04/01/2020    
05/01/2020    
06/01/2020 

我必须得到最长的连续周期——开始日期和结束日期。所以在上面的例子中,我有一个类似的输出

-----------------------------------------------
start       |   end           |  period_length |
-----------------------------------------------
04/01/2020    06/01/2020             3         

我的方法:对数据进行排序并找到与前一行的滞后,并且每当存在滞后>1、重置周期长度但我想不出在特定条件下重置周期的方法。我正在使用spark 2.3

注意:我的列名是"eventTime";像";2020-12-14 13:49:32";

sc.sql(
"""
|
|   select
|     min(eventTime), max(eventTime) ,  count(1)  as counts
|   from
|   (
|       select
|           eventTime , date_sub(eventTime , rn) as dis
|       from
|       (
|           select
|               eventTime , row_number() over(partition by 1 order by eventTime) rn
|           from (select distinct substring(eventTime,0,10) as eventTime from ST_INOUT_RECORD)
|       ) t1
|   ) t2
|   group by dis  having counts > 2
|
|""".stripMargin).show()

结果

|min(eventTime)|max(eventTime)|counts|
+--------------+--------------+------+
|    2020-09-12|    2020-12-14|    94|
|    2020-01-01|    2020-09-10|   254|
+--------------+--------------+------+

最新更新