给定一个表:
person_id | contact_day | days_last_contact | dash_group | 1 | 2015-02-09 | 1 | 1
---|---|---|---|
2015-05-01 | 81 | 2 | |
2015-05-02 | 1 | 2 | |
2015-05-03 | 1 | 2 | |
2015-06-01 | 29日 | 3 | |
2015-08-01 | 61 | 4 | |
2015-08-04 | 3 | 4 | |
2015-09-01 | 28 | 5 | |
2015-05-01 | 1 | ||
2015-06-01 | 31日 | 2 | |
2015-07-01 | 30 | 3 | |
3 | 2015-05-01 | 1 | |
3 | 2015-05-02 | 1 | 1 |
3 | 2015-05-04 | 2 | 1 |
3 | 2015-06-01 | 28 | 2 |
3 | 2015-06-02 | 1 | 2 |
3 | 2015-06-06 | 4 | 3 |
在第二个窗口函数中计算间隙(大于给定的公差)形成您所追求的组号:
SELECT person_id, contact_day
, count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM (
SELECT person_id, contact_day
, contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
FROM mydata
) sub
ORDER BY person_id, contact_day; -- optional
db<此处小提琴>此处小提琴>
关于聚合FILTER
子句:
- 使用额外的(不同的)过滤器聚合列
简短直观,通常是最快的。看到:
- 对于绝对性能,SUM更快还是COUNT更快?
这是"差距和岛屿"的经典话题. 一旦你知道如何寻找标签缺口和岛屿,你会发现许多相关或几乎相同的问题和答案,如:
- 选择最长连续序列
- 如何将时间戳分组到岛屿(基于任意间隙)?
- 如何在postgresql中标记组,当组属于依赖于前一行?
等。
我现在相应地做了标记。
使用递归查询:
WITH RECURSIVE zzz AS (
SELECT person_id
, contact_day
, md.days_last_contact
, row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
AS dash_group
FROM mydata md
WHERE NOT EXISTS ( -- only the group *leaders*
SELECT * FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day < md.contact_day
AND nx.contact_day >= md.contact_day -3
)
UNION ALL
SELECT md.person_id
, md.contact_day
, md.days_last_contact
, zzz.dash_group
FROM zzz
JOIN mydata md ON md.person_id = zzz.person_id
AND md.contact_day > zzz.contact_day
AND md.contact_day <= zzz.contact_day +3
AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day > zzz.contact_day
AND nx.contact_day < md.contact_day
)
)
SELECT * FROM zzz
ORDER BY person_id,contact_day
;
可能会有一个更短的解决方案,使用窗口函数。
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 14
person_id | contact_day | days_last_contact | dash_group
-----------+-------------+-------------------+------------
1 | 2015-02-09 | | 1
1 | 2015-05-01 | 81 | 2
1 | 2015-05-02 | 1 | 2
1 | 2015-05-03 | 1 | 2
1 | 2015-06-01 | 29 | 3
1 | 2015-08-01 | 61 | 4
1 | 2015-08-04 | 3 | 4
1 | 2015-09-01 | 28 | 5
2 | 2015-05-01 | | 1
2 | 2015-06-01 | 31 | 2
2 | 2015-07-01 | 30 | 3
3 | 2015-05-01 | | 1
3 | 2015-05-02 | 1 | 1
3 | 2015-05-04 | 2 | 1
(14 rows)
如果我理解正确,我们可以尝试在SUM
窗口函数中使用条件。
如果我们在mydata
表中创建一个合适的索引(person_id
和contact_day
列),我们可能会获得更好的性能。
查询# 1
所以查询可能如下
SELECT
person_id,
contact_day,
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day)
FROM mydata
ORDER BY person_id, contact_day
;
如果需要计算days_last_contact
,我们可以尝试计算它的子查询。
SELECT
person_id,
contact_day,
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day)
FROM (
SELECT person_id,
contact_day,
contact_day - lag(contact_day)
OVER (PARTITION BY person_id ORDER BY contact_day ASC)
AS days_last_contact
FROM mydata
) t1
ORDER BY person_id, contact_day
;
<表类>person_id contact_day days_last_contact 和 tbody><<tr>1 2015 - 02 - 09年t00:00:00.000z 1 12015 - 05 - 01 t00:00:00.000z 81 2 12015 - 05 - 02 t00:00:00.000z 1 2 12015 - 05 - 03 t00:00:00.000z 1 2 12015 - 06 - 01 t00:00:00.000z 29日 3 12015 - 08 - 01 t00:00:00.000z 61 4 12015 - 08 - 04 - t00:00:00.000z 3 4 12015 - 09 - 01 t00:00:00.000z 28 5 22015 - 05 - 01 t00:00:00.000z 1 22015 - 06 - 01 t00:00:00.000z 31日 2 22015 - 07 - 01 t00:00:00.000z 30 3 3 2015 - 05 - 01 t00:00:00.000z 1 3 2015 - 05 - 02 t00:00:00.000z 1 1 3 2015 - 05 - 04 - t00:00:00.000z 2 1 3 2015 - 06 - 01 t00:00:00.000z 28 2 3 2015 - 06 - 02年t00:00:00.000z 1 2 3 2015 - 06 - 06 t00:00:00.000z 4 3 表类>