假设我有两个表,一个描述带有某种时间戳的事件,另一个描述周期(由它们的开始时间给定),如下所示:
create table periods (name varchar(16), start int);
insert into periods values
('day1', 24), ('day2', 35), ('day3', 49);
create table events (id varchar(16), ts int);
insert into events values
('pak', 27), ('mak', 41), ('kew', 30),
('sippa', 58), ('fucha', 50), ('den', 28);
我希望最终将事件与句点相匹配,但使用"交错"方法而不是联接。然后我想象"交错"集会是这样的:
select * from (
(select name as per, s as x, 1 as pri, null as id from periods)
union all
(select null as per, t as x, 2 as pri, id from events)
) v order by x, pri;
+------+------+-----+-------+
| per | x | pri | id |
+------+------+-----+-------+
| day1 | 24 | 1 | NULL |
| NULL | 27 | 2 | pak |
| NULL | 28 | 2 | den |
| NULL | 30 | 2 | kew |
| day2 | 35 | 1 | NULL |
| NULL | 41 | 2 | mak |
| day3 | 49 | 1 | NULL |
| NULL | 50 | 2 | fucha |
| NULL | 58 | 2 | sippa |
+------+------+-----+-------+
现在我想在这里使用select ... over ... partition by
,这样per
字段为非空的行和该字段为空的后续行将落入同一分区,在该分区上我可以使用min(x)
(或句点名称)。
但是我很困惑怎么写这个partition by
。通过什么?
提前感谢!
p.S.关于特定的SQL方言,这是关于hive
的——这就是为什么试图避免联接——但我希望能够将提示与其他数据库引擎的语法相适应
已解决-请参阅下面接受的答案。我们需要额外的嵌套查询,其中我们将COUNT
应用于per
列。这将返回到目前为止(即高于当前值)字段具有非null值的行数。所以我们得到了一个列,每个组都有单独的编号。外部查询则只需要按此列进行分区。
许多数据库支持ISO/ANSI标准IGNORE NULL
对LAG()
的参数。这只允许:
select v.*,
lag(per ignore nulls) over (order by x) as new_per
from ((select name as per, s as x, 1 as pri, null as id from periods
) union all
(select null as per, t as x, 2 as pri, id from events
)
) v
order by x, pri;
Hive不支持这个选项,但你可以用两个级别的窗口功能做同样的事情:
select v.*,
max(per) over (partition by grp) as new_per
from (select v.*,
count(per) over (order by x) as grp
from ((select name as per, s as x, 1 as pri, null as id from periods
) union all
(select null as per, t as x, 2 as pri, id from events
)
) v
) v
order by x, pri;