我有一个看起来像这样的数据集:
respondent_id day_session daydiff
nmo876 11/19/2017 0
nmo876 11/20/2017 1
nmo876 11/21/2017 1
nmo876 11/23/2017 2
nmo876 11/24/2017 1
nmo876 11/25/2017 1
nmo876 11/26/2017 1
nmo876 11/27/2017 1
nmo876 11/28/2017 1
nmo876 11/29/2017 1
nmo876 11/30/2017 1
nmo876 12/1/2017 1
nmo876 12/2/2017 1
nmo876 12/3/2017 1
nmo876 12/4/2017 1
nmo876 12/5/2017 1
nmo876 12/6/2017 1
nmo876 12/7/2017 1
nmo876 12/8/2017 1
nmo876 12/9/2017 1
nmo876 12/10/2017 1
nmo876 12/11/2017 1
nmo876 12/12/2017 1
nmo876 12/13/2017 1
nmo876 12/14/2017 1
nmo876 12/15/2017 1
nmo876 12/16/2017 1
nmo876 12/17/2017 1
nmo876 12/18/2017 1
nmo876 12/19/2017 1
nmo876 12/20/2017 1
nmo876 12/23/2017 3
nmo876 12/24/2017 1
nmo876 12/26/2017 2
nmo876 12/27/2017 1
nmo876 12/28/2017 1
nmo876 12/29/2017 1
nmo876 12/30/2017 1
nmo876 12/31/2017 1
nmo876 1/2/2018 2
nmo876 1/3/2018 1
nmo876 1/4/2018 1
nmo876 1/5/2018 1
我想编写一个脚本,从用户可能有多个连续day_sessions的数据集中选择最大的连续day_sessions块,即 daydiff = 1。对于 nmo876,输出将为 27。
下面是代码应计算连续每日会话的最大块的更多数据。对于用户 jkl567,输出将为 37:
jkl567 11/19/2017 1
jkl567 11/20/2017 1
jkl567 11/21/2017 1
jkl567 11/22/2017 1
jkl567 11/23/2017 1
jkl567 11/24/2017 1
jkl567 11/25/2017 1
jkl567 11/26/2017 1
jkl567 11/27/2017 1
jkl567 11/28/2017 1
jkl567 11/29/2017 1
jkl567 11/30/2017 1
jkl567 12/1/2017 1
jkl567 12/2/2017 1
jkl567 12/3/2017 1
jkl567 12/4/2017 1
jkl567 12/5/2017 1
jkl567 12/6/2017 1
jkl567 12/7/2017 1
jkl567 12/8/2017 1
jkl567 12/9/2017 1
jkl567 12/10/2017 1
jkl567 12/11/2017 1
jkl567 12/12/2017 1
jkl567 12/13/2017 1
jkl567 12/14/2017 1
jkl567 12/15/2017 1
jkl567 12/16/2017 1
jkl567 12/17/2017 1
jkl567 12/18/2017 1
jkl567 12/19/2017 1
jkl567 12/20/2017 1
jkl567 12/21/2017 1
jkl567 12/22/2017 1
jkl567 12/23/2017 1
jkl567 12/24/2017 1
jkl567 12/25/2017 1
jkl567 12/26/2017 2
jkl567 12/28/2017 1
jkl567 12/29/2017 3
jkl567 1/1/2018 1
jkl567 1/2/2018 1
jkl567 1/3/2018 1
jkl567 1/4/2018 1
可以减去row_number()
以获取定义组的常量值。 要获取每个组的长度,请执行以下操作:
select respondent_id, (day_session - seqnum * interval '1 day') as grp, count(*) as days_in_row
from (select t.*,
row_number() over (partition by respondent_id order by day_session) as seqnum
from t
) t
group by respondent_id, (day_session - seqnum * interval '1 day');
然后,您可以使用distinct on
为每个受访者获取最大值。 我为此使用子查询:
select distinct on (respondent_id) t.*
from (select respondent_id, (day_session - seqnum * interval '1 day') as grp, count(*) as days_in_row
from (select t.*,
row_number() over (partition by respondent_id order by day_session) as seqnum
from t
) t
group by respondent_id, (day_session - seqnum * interval '1 day')
) t
order by respondent_id, days_in_row desc;
严格来说,子查询不是必需的。 我只是觉得这样打破逻辑更容易。