相当于Impala中的FIRST()和LAST()



来自不同传感器单元的数据来自Hadoop中的unix_time流。在SQL中使用first和last语句获取每天的第一个和最后一个值非常容易,如下所示。这里我得到了一天开始和结束时的温度。

SELECT unit, FIRST(Temp) as Start_Day_Value, LAST(Temp) as End_Day_Value
FROM Sensor_Data.Table
WHERE cast(ts/1000 as TIMESTAMP) BETWEEN "2021-01-05 00:00:00" AND "2021-01-05 23:59:59"
GROUP BY unit
ORDER BY unit;

在IMPALA中是否有任何等效的命令,以便我可以像在SQL中一样获得每列的第一个和最后一个值。

标准SQL有一个FIRST_VALUE()窗口函数,但是没有一个等价的聚合函数。因此,在Impala或任何其他数据库中执行所需操作的一种方法是:

SELECT DISTINCT unit,
FIRST_VALUE(Temp) OVER (PARTITION BY unit ORDER BY ts) as Start_Day_Value, 
FIRST_VALUE(Temp) OVER (PARTITION BY unit ORDER BY ts DESC) as End_Day_Value
FROM Sensor_Data.Table
WHERE cast(ts/1000 as TIMESTAMP) >= '2021-01-05' AND 
cast(ts/1000 as TIMESTAMP) < '2021-01-06'
ORDER BY unit;

注意,我还简化了日期比较。

注意:我不确定cast()工作(我希望代码使用from_timestamp())。但这不是你要问的。

编辑:

在Impala中,可以使用GROUP BY:

SELECT u.unit,
MAX(CASE WHEN seqnum_asc = 1 THEN temp END) as first_temp,
MAX(CASE WHEN seqnum_asc = 1 THEN temp END) as lasst_temp
FROM (SELECT unit,
ROW_NUMBER() OVER (PARTITION BY unit ORDER BY ts) as seqnum_asc, 
ROW_NUMBER() OVER (PARTITION BY unit ORDER BY ts desc) as seqnum_desc _Day_Value
FROM Sensor_Data.Table
WHERE cast(ts/1000 as TIMESTAMP) >= '2021-01-05' AND 
cast(ts/1000 as TIMESTAMP) < '2021-01-06'
) u
GROUP BY unit
ORDER BY unit;

我用下面的方法解决了这个问题:

SELECT DISTINCT u.unit, u.initial_temp, u.final_temp, u.initial_p, u.final_p
FROM(SELECT unit,
first_value(temperature) OVER(PARTITION BY unit ORDER BY ts DESC) as final_temp,
first_value(temperature) OVER(PARTITION BY unit ORDER BY ts ASC) as initial_temp,
first_value(pressure) OVER(PARTITION BY unit ORDER BY ts DESC) as final_p,
first_value(pressure) OVER(PARTITION BY unit ORDER BY ts ASC) as initial_p
FROM Sensor_Data.Table
WHERE cast(ts/1000 as TIMESTAMP) BETWEEN "2020-12-11 00:00:00" AND "2020-12-11 23:59:59"
) AS u
ORDER BY unit;