如何在Athena中使用日期分区创建VIEW



我想创建一个VIEW,它总是从最近的两个Athena分区返回最后1小时的数据。

我正在使用下面的AmazonAthena DDL和一个varchar类型的名为"datehour"的分区列。

CREATE EXTERNAL TABLE mydb.table_foo (
`account_id` string,
`account_email_address` string,
`record_timestamp` timestamp)
PARTITIONED BY ( 
`datehour` string)
ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://data-here-9191919191/table_foo'
TBLPROPERTIES (
'projection.datehour.format'='yyyy/MM/dd/HH', 
'projection.datehour.interval'='1', 
'projection.datehour.interval.unit'='HOURS', 
'projection.datehour.range'='2020/11/01/00,NOW', 
'projection.datehour.type'='date', 
'projection.enabled'='true', 
'storage.location.template'='s3://data-here-9191919191/table_foo/${datehour}', 
'transient_lastDdlTime'='1604013447')

以下是我想要运行的查询,以创建一个VIEW,它总是返回最近2个分区中最后1小时的数据。

select *
from mydb.table_foo
where
(datehour = CONCAT( 
CAST( year(current_timestamp) AS varchar) , '/',
CAST( month(current_timestamp) AS varchar), '/',
CAST( day(current_timestamp) AS varchar), '/',
CAST( hour(current_timestamp) AS varchar)) 
OR
datehour = CONCAT( 
CAST( year(current_timestamp) AS varchar) , '/',
CAST( month(current_timestamp) AS varchar), '/',
CAST( day(current_timestamp) AS varchar), '/',
CAST( ( hour(current_timestamp - interval '1' hour) ) AS varchar)))
AND
record_timestamp BETWEEN  (current_timestamp - interval '1' hour)  AND current_timestamp

我想让SQL动态化的示例是这样的:

select *
from table_foo
where datehour = '2020/11/17/23' or datehour = '2020/11/18/00' AND
record_timestamp BETWEEN  (current_timestamp - interval '1' hour) AND current_timestamp

动态WHERE逻辑会围绕所有字段(月、日和年(的更改带来问题。我在第一天,也就是今天午夜左右遇到了这个问题。如何动态生成datehour分区?

解决方案应该是以正确的格式将current_timestamp - interval '1' hour强制转换为varchar。假设数据被划分为yyyy/MM/dd/HH,如表定义中所述:

select *
from mydb.table_foo
WHERE datehour BETWEEN date_format(date_trunc('hour',current_timestamp - interval '1' hour),'%Y/%m/%d/%H') AND date_format(date_trunc('hour',current_timestamp),'%Y/%m/%d/%H')
AND record_timestamp BETWEEN  (current_timestamp - interval '1' hour)  AND current_timestamp

最新更新