我有一个由电子邮件、日期(TIMESTAMP(、id和多个其他列组成的表。
对于每个电子邮件条目,我想统计在前3天内与该电子邮件关联的唯一ID的数量。
+-------------------+-------------------------+------------+----+
| email | day | other cols | id |
+-------------------+-------------------------+------------+----+
| user1@gmail.com | 2020-06-21 16:31:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-22 14:54:00 UTC | ... | 1 |
| user1@gmail.com | 2020-06-23 08:23:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-24 13:51:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-25 09:54:00 UTC | ... | 2 |
| user1@gmail.com | 2020-06-25 12:25:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-26 15:21:00 UTC | ... | 2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC | ... | 0 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC | ... | 0 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC | ... | 0 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC | ... | 1 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC | ... | 1 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC | ... | 1 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC | ... | 2 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC | ... | 1 |
+-------------------+-------------------------+------------+----+
附加列应该如下所示:
+-------------------+-------------------------+------------+----+-----------------------------+
| email | day | other cols | id | distinct ids in last 3 days |
+-------------------+-------------------------+------------+----+-----------------------------+
| user1@gmail.com | 2020-06-21 16:31:00 UTC | ... | 0 | 1 |
| user1@gmail.com | 2020-06-22 14:54:00 UTC | ... | 1 | 2 |
| user1@gmail.com | 2020-06-23 08:23:00 UTC | ... | 0 | 2 |
| user1@gmail.com | 2020-06-24 13:51:00 UTC | ... | 0 | 2 |
| user1@gmail.com | 2020-06-25 09:54:00 UTC | ... | 2 | 3 |<- 3, because ids 0, 1 and 2 have been seen in previous 3 days
| user1@gmail.com | 2020-06-25 12:25:00 UTC | ... | 0 | 3 |
| user1@gmail.com | 2020-06-26 15:21:00 UTC | ... | 2 | 2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC | ... | 0 | 1 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC | ... | 0 | 1 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC | ... | 0 | 1 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC | ... | 1 | 2 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC | ... | 1 | 2 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC | ... | 1 | 2 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC | ... | 1 | 1 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC | ... | 1 | 1 |
+-------------------+-------------------------+------------+----+-----------------------------+
在过去的3天里,我尝试过使用窗口函数通过电子邮件进行分区并计算不同的ID。
COUNT(DISTINCT id) OVER (PARTITION BY email ORDER BY UNIX_DATE(PARSE_DATE('%Y-%m-%d', day))*24*3600 RANGE BETWEEN 3*24*3600 PRECEDING AND CURRENT ROW)
但是这是不允许的:
Window ORDER BY is not allowed if DISTINCT is specified
有一些关于堆栈溢出的解决方案,比如这样。然而,我不确定它是否考虑到在计算唯一ID之前需要通过电子邮件进行分区。
如果能给我任何建议,我将不胜感激。如果更容易的话,我也会对使用DATE而不是TIMESTAMP的解决方案持开放态度。
大多数(如果不是全部(数据库不支持窗口函数中的distinct
。在BigQuery中,通常使用窗口字符串或数组聚合来解决此问题:
select
t.* except(ids),
(select count(distinct id) from unnest(split(ids)) as id) cnt_distinct_id
from (
select
t.*,
string_agg(id) over(
partition by email
order by unix_date(parse_date('%y-%m-%d', day))*24*3600
range between 3 * 24 * 3600 preceding and current row
) ids
from mytable t
) t
子查询使用string_agg()
作为窗口函数,将前三天的所有id
聚合为一个字符串;然后,外部查询拆分并取消测试字符串,并计算不同的id
s。
下面是BigQuery标准SQL
#standardSQL
SELECT * EXCEPT(ids),
(SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
FROM `project.dataset.table`
WINDOW preceding_days AS (
PARTITION BY email
ORDER BY UNIX_DATE(DATE(day))
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
) t
你可以使用你的问题中的样本数据进行测试,如下面的示例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'user1@gmail.com' email, TIMESTAMP'2020-06-21 16:31:00 UTC' day, '...' other_cols, 0 id UNION ALL
SELECT 'user1@gmail.com', '2020-06-22 14:54:00 UTC', '...', 1 UNION ALL
SELECT 'user1@gmail.com', '2020-06-23 08:23:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-24 13:51:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-25 09:54:00 UTC', '...', 2 UNION ALL
SELECT 'user1@gmail.com', '2020-06-25 12:25:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-26 15:21:00 UTC', '...', 2 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-21 12:23:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-21 16:54:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-22 08:23:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-22 12:13:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-24 09:32:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-25 05:45:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-26 12:32:00 UTC', '...', 2 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-27 19:53:00 UTC', '...', 1
)
SELECT * EXCEPT(ids),
(SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
FROM `project.dataset.table`
WINDOW preceding_days AS (
PARTITION BY email
ORDER BY UNIX_DATE(DATE(day))
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
) t