我有一个类似的单列数据帧
------------
date
------------
01/01/2020
02/01/2020
04/01/2020
05/01/2020
06/01/2020
我必须得到最长的连续周期——开始日期和结束日期。所以在上面的例子中,我有一个类似的输出
-----------------------------------------------
start | end | period_length |
-----------------------------------------------
04/01/2020 06/01/2020 3
我的方法:对数据进行排序并找到与前一行的滞后,并且每当存在滞后>1、重置周期长度但我想不出在特定条件下重置周期的方法。我正在使用spark 2.3
注意:我的列名是"eventTime";像";2020-12-14 13:49:32";
sc.sql(
"""
|
| select
| min(eventTime), max(eventTime) , count(1) as counts
| from
| (
| select
| eventTime , date_sub(eventTime , rn) as dis
| from
| (
| select
| eventTime , row_number() over(partition by 1 order by eventTime) rn
| from (select distinct substring(eventTime,0,10) as eventTime from ST_INOUT_RECORD)
| ) t1
| ) t2
| group by dis having counts > 2
|
|""".stripMargin).show()
结果
|min(eventTime)|max(eventTime)|counts|
+--------------+--------------+------+
| 2020-09-12| 2020-12-14| 94|
| 2020-01-01| 2020-09-10| 254|
+--------------+--------------+------+