统计SQL (hive)中一天的部分事务数

我有两个表，我正在合并在一起，想要计算某些点之前的事务数。例如在下面两个表中:

CREATE TABLE table1 (ID INT, time TIME);
INSERT INTO table1 
VALUES
(1, '11:30:00'),
(1, '14:30:00'),
(1, '18:00:00')
;
CREATE TABLE table2 (ID INT, txn_time TIME, txn_val INT);
INSERT INTO table2
VALUES
(1, '10:45:13', 1),
(1, '10:50:52', 2),
(1, '11:01:20', 4),
(1, '14:32:12', 2),
(1, '16:43:20', 5),
(1, '19:22:02', 3)
;

我想计算表1中每次出现在表2之前的事务的数量(累加)。因此，对于上述数据，我需要下面的表:

┌─────────────┬──────────────┬──────────────┐
│ ID          │        time  │    txn count │
├─────────────┼──────────────┼──────────────┤
│ 1           │     11:30:00 │            3 │
│ 1           │     14:30:00 │            3 │
│ 1           │     18:00:00 │            5 │
└─────────────┴──────────────┴──────────────┘

我现在的代码是:

SELECT t1.ID,
t1.time,
sum(CASE WHEN t2.txn_time < t1.time THEN 1 END) over(PARTITION BY t1.time)
FROM table1 AS t1
LEFT JOIN table2 AS t2 on t1.ID = t2.ID
GROUP BY t1.ID, t1.time
ORDER BY t1.time

但是这给出了错误的数字，所以我遗漏了一些东西。您知道如何使用分区法来实现这个吗，或者是否有更简单的方法?最终，我会想要这样做好几天，计数在每天开始时重置。它也在做多对多连接，对于更大的数据集来说，这变得越来越昂贵，但我不知道如何防止这种情况。

您可以join和聚合:

select t1.id, t1.time, count(t2.id) as txn_count
from table1 t1
left join table2 t2 
on  t2.id = t1.id 
and t2.txn_time <= t1.time
group by t1.id, t1.time

这没有得到优化，因为table2中相同的行最终会随着table1中time的增加而多次连接。在table1中预先计算范围可能更有效，然后使用窗口函数:

select t1.id, t1.time, 
sum(count(t2.id)) over(partition by id order by time) as txn_count 
from (
select t1.*, lag(t1.time) over(partition by id order by time) lag_time
from table1 t1
) t1
left join table2 t2 
on  t2.id = t1.id
and t2.txn_time <= t1.time 
and (t2.txn_time > t1.lag_time or t1.lag_time is null)
group by t1.id, t1.time

相关内容

最新更新

热门标签：