Hive HQL - 优化重复的 WINDOW 子句



我有以下HQL

SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
       min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
       max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime 
FROM t21_pam6

如何将 3 个相同的 WINDOW 子句定义为一个?

文档 (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics) 显示此示例

SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)

但我不认为它奏效。不可能将窗口 w 定义为...不是 HQL 命令。

这种类型的优化是编译器需要做的事情。 我认为没有办法以编程方式确保这一点。

也就是说,计算最短时间是完全没有必要的。 由于order by,它应该是当前行中的时间。 同样,如果可以处理null值,则可以将表达式简化为:

SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
       event.time as minTime,
       lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime 
FROM t21_pam6;

请注意,maxtime计算略有不同,因为它将返回与条件匹配的最后两个值的NULL

正如@sergey-khudyakov回应的那样,文档中存在一个错误。此变体工作正常:

  SELECT count(*) OVER w,
          min(event.time) OVER w,
          max(event.time) OVER w 
   FROM ar3.t21_pam6
   WINDOW w AS (PARTITION BY identity.hwid, passwordused.domain ORDER BY      event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)

最新更新