窗口平均值,说明差距



我需要计算前4周的平均值。。。

SELECT
*,
AVG(val) OVER (PARTITION BY some_identifier, day_of_week_column
ORDER BY date_column
ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
)
AS preceding_4_week_average
FROM
myTable

然而,数据是"稀疏的">

在这种情况下,我的窗口函数应该返回"4周",而不是"4行">
-缺少的日期不是0,它隐含地是NULL

thing | date       | dow | val | avg
1   | 2018-01-01 |  1  |  1  | NULL  <= AVG({})
1   | 2018-01-08 |  1  |  2  | 1     <= AVG({1})
1   | 2018-01-15 |  1  |  3  | 1.5   <= AVG({1,2})
1   | 2018-01-22 |  1  |  4  | 2     <= AVG({1,2,3})
1   | 2018-01-29 |  1  |  5  | 2.5   <= AVG({1,2,3,4})
1   | 2018-02-12 |  1  |  7  | 4     <= AVG({3,4,5})
1   | 2018-02-19 |  1  |  8  | 5.33  <= AVG({4,5,7})
1   | 2018-02-26 |  1  |  9  | 6.66  <= AVG({5,7,8})
1   | 2018-03-05 |  1  |  10 | 8     <= AVG({7,8,9})
1   | 2018-03-12 |  1  |  11 | 11.25 <= AVG({7,8,9,10})
1   | 2018-03-19 |  1  |  12 | 9.5   <= AVG({8,9,10,11})

注:2018-02-05 没有值

我通常会用两种方式中的一种。。。


  1. LEFT JOIN到一个模板,以"强制"所有日期都存在,并有效地依赖AVG()"忽略"NULL

这并不理想,因为"事物"的数量巨大,并且构建此模板的成本高昂。

SELECT
*,
AVG(mytable.val) OVER (PARTITION BY things.id, dates.dow
ORDER BY dates.date
ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
)
AS preceding_4_week_average
FROM
things
CROSS JOIN
dates
LEFT JOIN
myTable
ON  myTable.date = dates.date
AND myTable.id   = things.id


  1. 不要使用窗口函数,而是使用自联接

这并不理想,因为myTable中有数百列,而BigQuery在这方面的表现不太好。

SELECT
myTable.*,
AVG(hist.val)   AS preceding_4_week_average
FROM
myTable
LEFT JOIN
myTable   AS hist
ON  hist.id    = myTable.id
AND hist.date >= myTable.date - INTERVAL 28 DAYS
AND hist.date <  myTable.date
GROUP BY
myTable.column1,
myTable.column2,
etc, etc


实际问题

其他人有没有其他选择,最好使用窗口/分析函数"回顾4周"而不是"回顾4行"?

下面是BigQuery标准SQL

正如你将看到的,诀窍是使用RANGE而不是ROW

#standardSQL
SELECT *,
AVG(val) OVER(
PARTITION BY id, dow 
ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
) AVG
FROM `project.dataset.table`   

你可以测试,玩上面的使用你的问题的伪数据如下

#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2018-01-01' date, 1 dow, 1 val UNION ALL
SELECT 1, '2018-01-08', 1, 2  UNION ALL
SELECT 1, '2018-01-15', 1, 3  UNION ALL
SELECT 1, '2018-01-22', 1, 4  UNION ALL
SELECT 1, '2018-01-29', 1, 5  UNION ALL
SELECT 1, '2018-02-12', 1, 7  UNION ALL
SELECT 1, '2018-02-19', 1, 8  UNION ALL
SELECT 1, '2018-02-26', 1, 9  UNION ALL
SELECT 1, '2018-03-05', 1, 10 UNION ALL
SELECT 1, '2018-03-12', 1, 11 UNION ALL
SELECT 1, '2018-03-19', 1, 12  
)
SELECT *,
AVG(val) OVER(
PARTITION BY id, dow 
ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
) avg
FROM `project.dataset.table`
-- ORDER BY date

结果为

Row id      date    dow val avg 
1   1   2018-01-01  1   1   null     
2   1   2018-01-08  1   2   1.0  
3   1   2018-01-15  1   3   1.5  
4   1   2018-01-22  1   4   2.0  
5   1   2018-01-29  1   5   2.5  
6   1   2018-02-12  1   7   4.0  
7   1   2018-02-19  1   8   5.333333333333333    
8   1   2018-02-26  1   9   6.666666666666667    
9   1   2018-03-05  1   10  8.0  
10  1   2018-03-12  1   11  8.5  
11  1   2018-03-19  1   12  9.5  

这是蛮力,但应该更快:

select t.*,
((case when date_1 >= date_add(date, interval -4 week)
then val_1 else 0
end) +
(case when date_2 >= date_add(date, interval -4 week)
then val_2 else 0
end) +
(case when date_3 >= date_add(date, interval -4 week)
then val_3 else 0
end) +
(case when date_4 >= date_add(date, interval -4 week)
then val_4 else 0
end)
) /
((case when date_1 >= date_add(date, interval -4 week)
then 1 else 0
end) +
(case when date_2 >= date_add(date, interval -4 week)
then 1 else 0
end) +
(case when date_3 >= date_add(date, interval -4 week)
then 1 else 0
end) +
(case when date_4 >= date_add(date, interval -4 week)
then 1 else 0
end)
)        
from (select t.*,
lag(val, 1) over (partition by id, dow order by date) as val_1,
lag(val, 2) over (partition by id, dow order by date) as val_2,
lag(val, 3) over (partition by id, dow order by date) as val_3,
lag(val, 4) over (partition by id, dow order by date) as val_4,
lag(date, 1) over (partition by id, dow order by date) as date_1,
lag(date, 2) over (partition by id, dow order by date) as date_2,
lag(date, 3) over (partition by id, dow order by date) as date_3,
lag(date, 4) over (partition by id, dow order by date) as date_4
from mytable t
) t;

可能有一种聪明的方法可以用数组来表达这一点,但我现在的位置有点早

相关内容

  • 没有找到相关文章

最新更新