"ROWS BETWEEN"和"RANGE BETWEEN"在(Presto)窗口中的区别 函数"OVER"子句



这个问题主要是关于旧版本的PrestoSQL,这些问题已在Trino项目中解决346.然而,亚马逊的雅典娜项目是基于Presto版本0.217(雅典娜引擎2(和0.172(雅典娜引擎1(,这确实存在以下问题。这个问题是专门围绕Athena Engine 1/PrestoSQL版本0.172写的

问题(太长了,读不下去了(

  1. Presto窗口函数中的ROWS BETWEENRANGE BETWEEN之间有什么区别?
    • 这些只是彼此的同义词,还是存在核心概念差异
    • 如果它们只是同义词,为什么ROWS BETWEEN允许比RANGE BETWEEN更多的选项
  2. 是否存在可以在ROWS BETWEENRANGE BETWEEN上使用完全相同的参数并获得不同结果的查询场景?
    • 如果只使用unbounded/current row,是否存在使用RANGE而不是ROWS的情况(反之亦然(
  3. 既然ROWS有更多的选项,为什么文档中根本没有提到它?o_o

注释

presto文档甚至对RANGE都相当安静,没有提到ROWS。我在Presto中没有发现很多关于窗口函数的讨论或例子。我开始浏览Presto代码库,试图弄清楚这一点。希望有人能救我一命,我们可以一起改进文档。

Presto代码有一个ROWS变体的解析器和测试用例,但文档中没有提到ROWS

我在ROWSRANGE中发现的测试用例并没有测试这两种语法之间的任何不同。

它们看起来几乎像同义词,但在我的测试中表现不同,并且有不同的允许参数和验证规则。

以下示例可以使用运行presto 0.213-e-0.1的starburstdata/presso-Docker映像运行。通常,我通过Amazon Athena运行Presto 0.172,并且几乎总是使用ROWS

范围

RANGE似乎被限制为";UNBOUNDED";以及";当前行";。以下返回错误:

range between 1 preceding and 1 following

use tpch.tiny;
select custkey, orderdate,
array_agg(orderdate) over ( 
partition by custkey 
order by orderdate asc 
range between 1 preceding and 1 following
) previous_orders 
from orders where custkey in (419, 320) and orderdate < date('1996-01-01')
order by custkey, orderdate asc;

错误:Window frame RANGE PRECEDING is only supported with UNBOUNDED

以下范围语法确实可以正常工作(预期会产生不同的结果(以下所有示例都基于上述查询,只是更改范围

range between unbounded preceding and current row

custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
320 | 1992-07-10 | [1992-07-10]
320 | 1992-07-30 | [1992-07-10, 1992-07-30]
320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08]
320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18]
320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
419 | 1992-03-16 | [1992-03-16]
419 | 1993-12-29 | [1992-03-16, 1993-12-29]
419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

range between current row and unbounded following

custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1992-07-30 | [1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-07-08 | [1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-08-04 | [1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-09-18 | [1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1993-12-29 | [1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1995-01-30]

无界前序和无界后序之间的范围

custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

上述RANGE的三个工作示例都适用于ROWS,并产生相同的输出。

rows between unbounded preceding and current row
rows between current row and unbounded following
rows between unbounded preceding and unbounded following

省略输出-与上述相同

然而,ROWS允许更多的控制,因为您也可以使用range:执行上面失败的语法

rows between 1 preceding and 1 following

custkey | orderdate  |           previous_orders
---------+------------+--------------------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30]
320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08]
320 | 1994-07-08 | [1992-07-30, 1994-07-08, 1994-08-04]
320 | 1994-08-04 | [1994-07-08, 1994-08-04, 1994-09-18]
320 | 1994-09-18 | [1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1994-09-18, 1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29]
419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1993-12-29, 1995-01-30]

rows between current row and 1 following

custkey | orderdate  |     previous_orders
---------+------------+--------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30]
320 | 1992-07-30 | [1992-07-30, 1994-07-08]
320 | 1994-07-08 | [1994-07-08, 1994-08-04]
320 | 1994-08-04 | [1994-08-04, 1994-09-18]
320 | 1994-09-18 | [1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29]
419 | 1993-12-29 | [1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1995-01-30]

rows between 5 preceding and 2 preceding

custkey | orderdate  |                 previous_orders
---------+------------+--------------------------------------------------
320 | 1992-07-10 | NULL
320 | 1992-07-30 | NULL
320 | 1994-07-08 | [1992-07-10]
320 | 1994-08-04 | [1992-07-10, 1992-07-30]
320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08]
320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
419 | 1992-03-16 | NULL
419 | 1993-12-29 | NULL
419 | 1995-01-30 | [1992-03-16]
  • ROWS实际上是要聚合的前后行数。因此,无论orderdate的值如何,ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING最终都将有3行:curnet行前1行,后1行
  • RANGE将查看orderdate的值,并决定哪些应该聚合,哪些不应该聚合。因此,理论上ORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING将采用值为orderdate-1、orderdate和orderdate+1的所有行——这可能超过3行(请参阅此处的更多解释(

在Presto中,ROWS是完全实现的,但RANGE在某种程度上只是部分实现的,并且您只能在中与CURRENT ROWUNBOUNDED一起使用。

注意:最近版本的Trino(以前称为Presto SQL(已满支持CCD_ 38和CCD_。查看的此博客文章对它们如何工作的解释。

在Presto中,要想看到两者之间的差异,最好的方法是确保order子句的值相同:

WITH
tt1  (custkey, orderdate, product) AS 
( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) )  )
SELECT *, 
array_agg(product) OVER (partition by custkey) c, 
array_agg(product) OVER (partition by custkey order by orderdate) c_order,

array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,

array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,

array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,

-- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING)  range22,
-- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN 2 PRECEDING AND 2 FOLLOWING)  rows22
from tt1
order by custkey, orderdate, product

你可以跑步,看到完整的结果,并从中学习。。

我只在这里放一些有趣的专栏:

custkey   orderdate     product    range_ubc           rows_ubc
a         10/07/1992    3          [3]                 [3]
a         10/08/1993    4          [3, 4]              [3, 4]
a         13/07/1994    5          [3, 4, 5]           [3, 4, 5]
a         13/09/1995    5          [3, 4, 5, 5, 9]     [3, 4, 5, 5]
a         13/09/1995    9          [3, 4, 5, 5, 9]     [3, 4, 5, 5, 9]
a         13/01/1997    4          [3, 4, 5, 5, 9, 4]  [3, 4, 5, 5, 9, 4]
b         10/07/1992    4          [6, 4]              [6, 4]
b         10/07/1992    6          [6, 4]              [6]
b         13/07/1994    5          [6, 4, 5, 9]        [6, 4, 5]
b         13/07/1994    9          [6, 4, 5, 9]        [6, 4, 5, 9]
b         11/11/1998    9          [6, 4, 5, 9, 9]     [6, 4, 5, 9, 9]

如果您查看:orderdate:13/09/1995, product:5的第5行(注意:对于custkey:a13/09/1995出现两次(,您可以看到ROWS确实占用了从顶部到当前行的所有行。但如果您查看RANGE,您会发现它还包括之后的行中的值,因为它具有完全相同的orderdate,因此在同一窗口中被视为

最新更新