此查询每月一次失败,因为两者之间的部分无效。在value BETWEEN min AND max
的情况下,3月1日,我的查询将再次失败,因为它将计算为partition_2 BETWEEN 28 AND 1
。如何使此查询更加可靠,但仍然只使用所需的分区?
WITH recent_tasks AS
(SELECT task_id, state, timestamp, partition_0, partition_1, partition_2,
row_number() OVER (PARTITION BY task_id
ORDER BY timestamp DESC) AS rn
FROM firehose
WHERE
"partition_0" BETWEEN to_char(current_date - interval '1' day, 'yyyy') AND to_char(current_date, 'yyyy')
and "partition_1" BETWEEN to_char(current_date - interval '1' day, 'mm') AND to_char(current_date, 'mm')
and "partition_2" BETWEEN to_char(current_date - interval '1' day, 'dd') AND to_char(current_date, 'dd')
ORDER BY rn)
SELECT * FROM recent_tasks
WHERE rn=1
几个注释:
- 分区是char值而不是整数
- 分区_2是一个月分区
- 查询的目的是找到每个task_id的最新状态
- 使用AWS Athena
- 数据以s3/yyyy/mm/dd格式存储,每天都是新分区
理想情况下,我的查询将正确处理每月过渡:
BETWEEN FEB 10 AND FEB 11 (works with above)
BETWEEN FEB 28 AND MAR 1 (fails with above)
BETWEEN MAR 1 AND MAR 2 (works with above)
如果您想获得零而不是28:
cast(to_char(current_date, 'dd') as signed)-1
so,在03/01上,这将返回1 -to_char(current_date,'dd'(,然后减去它将给您零:
and "partition_2" BETWEEN to_char(cast(to_char(current_date, 'dd') as signed)-1) AND to_char(current_date, 'dd')