如何形成允许给定最大差距的连续日期组?



给定一个表:

tbody> <<tr>1111111222
person_id contact_day days_last_contact dash_group
12015-02-091
2015-05-01812
2015-05-0212
2015-05-0312
2015-06-0129日3
2015-08-01614
2015-08-0434
2015-09-01285
2015-05-011
2015-06-0131日2
2015-07-01303
32015-05-011
32015-05-0211
32015-05-0421
32015-06-01282
32015-06-0212
32015-06-0643

在第二个窗口函数中计算间隙(大于给定的公差)形成您所追求的组号:

SELECT person_id, contact_day
, count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM  (
SELECT person_id, contact_day
, contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
FROM   mydata
) sub
ORDER  BY person_id, contact_day;  -- optional

db<此处小提琴>

关于聚合FILTER子句:

  • 使用额外的(不同的)过滤器聚合列

简短直观,通常是最快的。看到:

  • 对于绝对性能,SUM更快还是COUNT更快?

这是"差距和岛屿"的经典话题. 一旦你知道如何寻找标签缺口和岛屿,你会发现许多相关或几乎相同的问题和答案,如:

  • 选择最长连续序列
  • 如何将时间戳分组到岛屿(基于任意间隙)?
  • 如何在postgresql中标记组,当组属于依赖于前一行?

等。

我现在相应地做了标记。

使用递归查询:
WITH RECURSIVE zzz AS (
SELECT person_id
, contact_day
, md.days_last_contact
, row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
AS dash_group
FROM mydata md
WHERE NOT EXISTS ( -- only the group *leaders*
SELECT * FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day < md.contact_day
AND nx.contact_day >= md.contact_day -3
)
UNION ALL
SELECT md.person_id
, md.contact_day
, md.days_last_contact
, zzz.dash_group
FROM zzz
JOIN mydata md ON md.person_id = zzz.person_id
AND md.contact_day > zzz.contact_day
AND md.contact_day <= zzz.contact_day +3
AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day > zzz.contact_day
AND nx.contact_day < md.contact_day
)
)
SELECT * FROM zzz
ORDER BY person_id,contact_day
;

可能会有一个更短的解决方案,使用窗口函数。

结果:


DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 14
person_id | contact_day | days_last_contact | dash_group 
-----------+-------------+-------------------+------------
1 | 2015-02-09  |                   |          1
1 | 2015-05-01  |                81 |          2
1 | 2015-05-02  |                 1 |          2
1 | 2015-05-03  |                 1 |          2
1 | 2015-06-01  |                29 |          3
1 | 2015-08-01  |                61 |          4
1 | 2015-08-04  |                 3 |          4
1 | 2015-09-01  |                28 |          5
2 | 2015-05-01  |                   |          1
2 | 2015-06-01  |                31 |          2
2 | 2015-07-01  |                30 |          3
3 | 2015-05-01  |                   |          1
3 | 2015-05-02  |                 1 |          1
3 | 2015-05-04  |                 2 |          1
(14 rows)

如果我理解正确,我们可以尝试在SUM窗口函数中使用条件。

如果我们在mydata表中创建一个合适的索引(person_idcontact_day列),我们可能会获得更好的性能。

查询# 1

所以查询可能如下

SELECT 
person_id, 
contact_day, 
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day) 
FROM mydata
ORDER BY person_id, contact_day
;

如果需要计算days_last_contact,我们可以尝试计算它的子查询。

SELECT 
person_id, 
contact_day, 
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day) 
FROM (
SELECT person_id,
contact_day,
contact_day - lag(contact_day) 
OVER (PARTITION BY person_id ORDER BY contact_day ASC) 
AS days_last_contact
FROM mydata
) t1
ORDER BY person_id, contact_day
;
<表类>person_idcontact_daydays_last_contact和tbody><<tr>12015 - 02 - 09年t00:00:00.000z112015 - 05 - 01 t00:00:00.000z81212015 - 05 - 02 t00:00:00.000z1212015 - 05 - 03 t00:00:00.000z1212015 - 06 - 01 t00:00:00.000z29日312015 - 08 - 01 t00:00:00.000z61412015 - 08 - 04 - t00:00:00.000z3412015 - 09 - 01 t00:00:00.000z28522015 - 05 - 01 t00:00:00.000z122015 - 06 - 01 t00:00:00.000z31日222015 - 07 - 01 t00:00:00.000z30332015 - 05 - 01 t00:00:00.000z132015 - 05 - 02 t00:00:00.000z1132015 - 05 - 04 - t00:00:00.000z2132015 - 06 - 01 t00:00:00.000z28232015 - 06 - 02年t00:00:00.000z1232015 - 06 - 06 t00:00:00.000z43

相关内容

  • 没有找到相关文章

最新更新