sql:how-to组/分区交错表与null



假设我有两个表,一个描述带有某种时间戳的事件,另一个描述周期(由它们的开始时间给定),如下所示:

create table periods (name varchar(16), start int);
insert into periods values
('day1', 24), ('day2', 35), ('day3', 49);
create table events (id varchar(16), ts int);
insert into events values
('pak', 27), ('mak', 41), ('kew', 30),
('sippa', 58), ('fucha', 50), ('den', 28);

我希望最终将事件与句点相匹配,但使用"交错"方法而不是联接。然后我想象"交错"集会是这样的:

select * from (
(select name as per, s as x, 1 as pri, null as id from periods) 
union all
(select null as per, t as x, 2 as pri, id from events)
) v order by x, pri;
+------+------+-----+-------+
| per  | x    | pri | id    |
+------+------+-----+-------+
| day1 |   24 |   1 | NULL  |
| NULL |   27 |   2 | pak   |
| NULL |   28 |   2 | den   |
| NULL |   30 |   2 | kew   |
| day2 |   35 |   1 | NULL  |
| NULL |   41 |   2 | mak   |
| day3 |   49 |   1 | NULL  |
| NULL |   50 |   2 | fucha |
| NULL |   58 |   2 | sippa |
+------+------+-----+-------+

现在我想在这里使用select ... over ... partition by,这样per字段为非空的行和该字段为空的后续行将落入同一分区,在该分区上我可以使用min(x)(或句点名称)。

但是我很困惑怎么写这个partition by。通过什么?

提前感谢!

p.S.关于特定的SQL方言,这是关于hive的——这就是为什么试图避免联接——但我希望能够将提示与其他数据库引擎的语法相适应


已解决-请参阅下面接受的答案。我们需要额外的嵌套查询,其中我们将COUNT应用于per列。这将返回到目前为止(即高于当前值)字段具有非null值的行数。所以我们得到了一个列,每个组都有单独的编号。外部查询则只需要按此列进行分区。

许多数据库支持ISO/ANSI标准IGNORE NULLLAG()的参数。这只允许:

select v.*,
lag(per ignore nulls) over (order by x) as new_per
from ((select name as per, s as x, 1 as pri, null as id from periods 
) union all
(select null as per, t as x, 2 as pri, id from events
)
) v
order by x, pri;

Hive不支持这个选项,但你可以用两个级别的窗口功能做同样的事情:

select v.*,
max(per) over (partition by grp) as new_per
from (select v.*,
count(per) over (order by x) as grp
from ((select name as per, s as x, 1 as pri, null as id from periods 
) union all
(select null as per, t as x, 2 as pri, id from events
)
) v
) v
order by x, pri;

最新更新