我有一个大查询中缺少日期记录的表,我想先在表中插入缺少的日期记录,然后插入以前日期记录的分数,这在大查询中如何实现?
当前样品表视图:
Row date timeStamp score
1 2018-01-21 1516492800 0.44013312375
2 2018-01-22 1516579200 0.3821605743
3 2018-01-24 1516752000 0.3397971282666667
期望的表格视图:
Row date timeStamp score
1 2018-01-21 1516492800 0.44013312375
2 2018-01-22 1516579200 0.3821605743
3 2018-01-23 1516665600 0.3821605743
4 2018-01-24 1516752000 0.3397971282666667
基本上,由于日期2018-01-23缺少一条记录,我们插入了一条记录并从以前的日期中选择了分值。
下面是BigQuery标准SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2018-01-21' date, 1516492800 timeStamp, 0.44013312375 score UNION ALL
SELECT '2018-01-22', 1516579200, 0.3821605743 UNION ALL
SELECT '2018-01-24', 1516752000, 0.3397971282666667
)
SELECT date,
UNIX_SECONDS(TIMESTAMP(date)) AS timeStamp,
LAST_VALUE(score IGNORE NULLS) OVER(ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS score
FROM (
SELECT MIN(date) min_date, MAX(date) max_date
FROM `project.dataset.table`
), UNNEST(GENERATE_date_ARRAY(min_date, max_date)) date
LEFT JOIN `project.dataset.table`
USING(date)
ORDER BY date
结果
Row date timeStamp score
1 2018-01-21 1516492800 0.44013312375
2 2018-01-22 1516579200 0.3821605743
3 2018-01-23 1516665600 0.3821605743
4 2018-01-24 1516752000 0.3397971282666667
我只会为每行所需的值生成日期数组:
WITH t AS (
SELECT DATE '2018-01-21' date, 1516492800 timeStamp, 0.44013312375 score UNION ALL
SELECT '2018-01-22', 1516579200, 0.3821605743 UNION ALL
SELECT '2018-01-24', 1516752000, 0.3397971282666667
)
SELECT dte,
UNIX_SECONDS(TIMESTAMP(dte)) AS timeStamp,
t.score
FROM (SELECT t.*, DATE_ADD(LEAD(date) OVER (ORDER BY date), INTERVAL -1 day) as to_date
FROM t
) t CROSS JOIN
UNNEST(GENERATE_date_ARRAY(date, COALESCE(to_date, date))) dte
ORDER BY dte ;
我敢肯定,与大量数据相比,它具有更好的性能特征。事实上,我昨天向一位同事提出了这样的方法,试图加快她的一个查询速度。