我有一个10亿行的数据集,它不断地用更多关于客户的重复数据构建。
ID creation_date report_date status
001 2021-01-20T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-20T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-21T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-30T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-03-01T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
002 2021-01-30T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
每个report_date表示将所有记录更新为当前状态的日期。就像脉搏检查一样。
我想要的只是用户在创建日期后一个月的一周内(第5周(的最后状态
例如:ID=001。
在这里,我们看到它们的创建日期是2021-01-20
,意味着从这个日期起一个月是2021-02-20
。我想知道:
- 在
2021-02-20
和2021-02-27
之间的报告日期期间,此用户的最终状态是什么
您可以在上面的数据中看到,Active在2021-02-20
和2021-02-27
(列出的(之间的所有报告中都保持活动状态
简单地说,我们只想知道这个时间段内状态的最后一次变化。注意,在ID=003中,它们在2021-02-22', so though they were
EXPIRED`前一天切换到ACTIVE
,在边界内切换回活动状态。
一个月后一周内的任何事情(5周后的任何事情(都无关紧要。
你也可能会注意到,从2021-01-30
开始的一个月是2021-02-30
,这没有意义。在这些情况下,请使用当月的最终日期或2021-02-28
。
最终输出:
ID week_5_status
001 ACTIVE
002 EXPIRED
003 ACTIVE
首先,将文本值(大概(转换为有效的日期时间值。然后,过滤行,使report_datetime在creation_datetime之后少于6周。取过滤列表的最大值,然后连接回原始数据,以获得具有最大值的行的状态。
CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
INSERT INTO t (id,creation_date,report_date,status) VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');
WITH dat
AS
(
SELECT id
, CAST(creation_date AS datetime) AS creation_datetime
, CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
, status
FROM t
),
dat2
AS
(
SELECT id
,MAX(report_datetime) AS max_report_datetime
FROM dat
WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
GROUP BY id
)
SELECT dat.*
FROM dat
INNER JOIN dat2
ON dat.id = dat2.id
AND dat.report_datetime = dat2.max_report_datetime;
dbfiddle.uk
QUALIFY和ROW_NUMBER似乎是您想要选择的"窗口内的最后状态";
因此,使用data
的CTE,我将几个无效的report_date
更改为真实日期。也许是错误的方向,但它不会影响SQL。
WITH data(id,creation_date,report_date,status)AS (
SELECT column1
,to_date(column2, 'YYYY-MM-DDThh:mi:ss')
,to_date(column3, 'YYYY-MM-DDThh:mi:ss')
,column4
FROM VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-28T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED')
)
主SQL变为:
SELECT d.id
,d.creation_date
,d.report_date
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
给出:
ID | 创建日期 | 报告日期 | 状态|
---|---|---|---|
1 | 2021-01-20 | 2021:02-21 | 活动 |
2 | 2021-01-30 | 2021-03-01 | 到期 |
3 | 2021-02-01 | 2022-03-01 | 活动 |