BigQuery 分区表:指定月份的最短方法是什么?



想知道查询分区表时指定月份的最短方法是什么。

TIMESTAMP_TRUNC月份看起来很诱人,但似乎不能用作分区过滤器:

SELECT SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE TIMESTAMP_TRUNC(datehour, month) = '2018-04-01'
Cannot query over table 'fh-bigquery.wikipedia_v2.pageviews_2018' without a filter that can be used for partition elimination

BETWEEN日期看起来也很诱人,但需要知道每个月有多少天:

SELECT SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) BETWEEN '2018-04-01' AND '2018-04-31'
Could not cast literal "2018-04-31" to type DATE at [3:47]

DATE_SUB(DATE_ADD(month), day有效,但需要键入两次日期,并且键入时间很长:

SELECT SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) 
BETWEEN '2018-04-01' 
AND DATE_SUB(DATE_ADD('2018-04-01', INTERVAL 1 MONTH), INTERVAL 1 DAY) 
15746003449

您将如何改善这一点?

我会这样做:

SELECT SUM(views) as views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE hour >= date '2018-04-01' AND hour < date '2018-05-01';

您可以将日期常量放在 CTE 中:

with params as (
select date '2018-04-01' as dte
)
select sum(views) as views
from params cross join
`fh-bigquery.wikipedia_v2.pageviews_2018`
where hour >= params.dte AND hour < date_add(params.dte, interval 1 month)

更新:在进一步实验的同时,这是我最好的解决方案:

SELECT SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE_TRUNC(DATE(datehour), month) = '2018-04-01'

这本质上是问题中的第一次尝试,加上将时间戳转换为 DATE,然后应用DATE_TRUNC.

将我尝试的其他选项留在下面,因为它们可能对其他情况有用。


一种选择是使用 WITH 定义变量,因此只键入一次月份:

WITH month AS (SELECT DATE('2018-04-01') m), 
full_month AS (SELECT m AS s, DATE_SUB(DATE_ADD(m, INTERVAL 1 MONTH), INTERVAL 1 DAY) AS e FROM month)
SELECT SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) 
BETWEEN (SELECT s FROM full_month) AND (SELECT e FROM full_month)

同样,您可以定义 SQL UDF 函数:

CREATE TEMPORARY FUNCTION month() AS (DATE('2018-04-01'));
CREATE TEMPORARY FUNCTION month_end() AS (DATE_SUB(DATE_ADD(month(), INTERVAL 1 MONTH), INTERVAL 1 DAY));
SELECT SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) BETWEEN month() AND month_end() 

对于这两个选项,BigQuery 只能识别和优化扫描所需的分区。

相关内容

最新更新