BigQuery在一次过多的UNIONALL操作后抱怨——为什么会发生这种情况,我的选择是什么



我正在开发一个非常复杂的查询,它需要我多次堆叠数据(即UNION ALL(。令我惊讶的是,BigQuery不喜欢堆叠,并且试运行显示异常:

查询执行期间超出资源:没有足够的资源用于查询计划-子查询太多或查询太复杂。

我已经隔离了查询中出现问题的点,以确认问题似乎是由过多的UNION ALL引起的。我很惊讶UNION ALL会这么做,但我怀疑我在这里的想法很天真。

  1. 为什么BigQuery不能处理这个额外的UNION ALL?堆叠数据不是一种更直接的操作吗?

  2. 我有什么选择可以达到同样的结果?有没有我不知道的手术可以做同样的工作或替代方法?

这是完整的查询,尽管我应该注意到project.dataset.source_view确实首先进行了一些相对直接的处理:

WITH p0_funnel AS (
SELECT  
date, 
platform_type, 
platform, 
flow,
step_1, 
step_2, 
step_3, 
step_4, 
step_5, 
step_6
FROM `project.dataset.source_view`
), p1_funnel AS (
SELECT
date,
flow,
platform_type,
platform,
SUM(step_1) AS step_1, 
SUM(step_2) AS step_2, 
SUM(step_3) AS step_3, 
SUM(step_4) AS step_4, 
SUM(step_5) AS step_5, 
SUM(step_6) AS step_6
FROM p0_funnel
GROUP BY 
date, 
flow,
platform_type,
platform
), p2_funnel AS (
SELECT
date,
flow,
platform,
platform_type,
step_1,
step_2,
step_3,
step_4,
step_5,
step_6
FROM p1_funnel
), p3_funnel AS (
SELECT
date, platform, platform_type, flow,
'step_1' AS step,
step_1 AS step_sessions
FROM p1_funnel
UNION ALL
SELECT
date, platform, platform_type, flow,
'step_2' AS step,
step_2 AS step_sessions
FROM p1_funnel
UNION ALL
SELECT
date, platform, platform_type, flow,
'step_3' AS step,
step_3 AS step_sessions
FROM p1_funnel
UNION ALL
SELECT
date, platform, platform_type, flow,
'step_4' AS step,
step_4 AS step_sessions
FROM p1_funnel
UNION ALL
SELECT
date, platform, platform_type, flow,
'step_5' AS step,
step_5 AS step_sessions
FROM p1_funnel
UNION ALL
SELECT
date, platform, platform_type, flow,
'step_6' AS step,
step_6 AS step_sessions
FROM p1_funnel
), p4_funnel AS (
SELECT
main.date,
platform, platform_type, flow,
step,
step_1,
step_2,
step_3,
step_4,
step_5,
step_6,
step_sessions
FROM p3_funnel AS main
JOIN p2_funnel USING(date, platform, platform_type, flow)
), funnel_platform_type AS (
SELECT
date,
'platform_type' AS dimension,
platform_type AS value,
step,
step_1,
step_2,
step_3,
step_4,
step_5,
step_6,
step_sessions
FROM p4_funnel
), funnel_platform AS (
SELECT
date,
'platform' AS dimension,
platform AS value,
step,
step_1,
step_2,
step_3,
step_4,
step_5,
step_6,
step_sessions
FROM p4_funnel
), funnel_flow AS (
SELECT
date,
'flow' AS dimension,
flow AS value,
step,
step_1,
step_2,
step_3,
step_4,
step_5,
step_6,
step_sessions
FROM p4_funnel
), p5_funnel AS (
SELECT * FROM funnel_platform_type UNION ALL
SELECT * FROM funnel_platform UNION ALL
SELECT * FROM funnel_flow # including this UNION ALL first introduces the problem
)
SELECT
date,
dimension,
ROW_NUMBER() OVER (PARTITION BY dimension, step ORDER BY step_1 DESC) AS dim_order,
value,
step,
CASE
WHEN step = 'step_1' THEN 1
WHEN step = 'step_2' THEN 2
WHEN step = 'step_3' THEN 3
WHEN step = 'step_4' THEN 4
WHEN step = 'step_5' THEN 5
WHEN step = 'step_6' THEN 6
ELSE null
END AS step_order,  
CASE
WHEN step = 'step_1' THEN step_2
WHEN step = 'step_2' THEN step_3
WHEN step = 'step_3' THEN step_4
WHEN step = 'step_4' THEN step_5
WHEN step = 'step_5' THEN step_6
WHEN step = 'step_6' THEN null
ELSE null
END AS next_step_sessions,
step_1,
step_2,
step_3,
step_4,
step_5,
step_6,
step_sessions
FROM p5_funnel

建议使用临时表,而不是大量的WITH子句等。将查询分解为几个更简单的查询,并将中间结果持久化为短期表或临时表应该有助于解决此错误。

WITH子句包含一个或多个命名子查询,每次后续SELECT语句引用这些查询时都会执行这些查询。任何子句或子查询都可以引用您在WITH子句中定义的子查询。这包括集合运算符两侧的任何SELECT语句,例如UNION。

WITH子句主要用于可读性,因为BigQuery不会在WITH子句中具体化查询的结果。如果查询出现在多个WITH子句中,它将在每个子句中执行。

发生这种情况是因为BigQuery子查询配额。不是因为";UNION All";。当我尝试执行一个包含超过125个子查询的查询时,我也遇到了同样的问题。因此,尝试按子查询数对查询进行分区,并将每个部分插入临时表中,然后从临时表中收集数据,并在完成时将其删除。

我想我已经找到了(2(的答案。

我仍然不清楚(1(,我仍然感兴趣。我只能推测这是由于BigQuery的内部工作。

我对(1(的解决方案使用了不同的查询策略。从某种意义上说,它通过首先使用CROSS JOIN收集所需尺寸的网格,然后简单地连接到数据集并修剪不需要的部分,将举重与体操区分开来。

以下是它的样子:

WITH p0_funnel AS (
SELECT  
date, 
platform_type, 
platform, 
flow, 
step_1, 
step_2, 
step_3, 
step_4, 
step_5, 
step_6s AS step_6
FROM `project.dataset.source`
), p1_funnel AS (
SELECT
flow,
platform_type,
platform,
SUM(step_1) AS step_1, 
SUM(step_2) AS step_2, 
SUM(step_3) AS step_3, 
SUM(step_4) AS step_4, 
SUM(step_5) AS step_5, 
SUM(step_6) AS step_6
FROM p0_funnel
GROUP BY 
flow,
platform_type,
platform
), dimension_values AS (
SELECT DISTINCT
flow,
platform_type,
platform  
FROM p1_funnel
), dimension_names AS (
SELECT 'platform_type' AS dimension UNION ALL
SELECT 'platform' UNION ALL
SELECT 'flow'
), steps AS (
SELECT 'step_1' AS step UNION ALL
SELECT 'step_2' UNION ALL
SELECT 'step_3' UNION ALL
SELECT 'step_4' UNION ALL
SELECT 'step_5' UNION ALL
SELECT 'step_6'
), full_grid AS (
SELECT
dimension,
step,
flow,
platform_type,
platform  
FROM dimension_values
CROSS JOIN dimension_names
CROSS JOIN steps
)
SELECT
dimension,
ROW_NUMBER() OVER (PARTITION BY dimension, step ORDER BY step_1 DESC) AS dim_order,
CASE
WHEN dimension = 'platform_type' THEN platform_type
WHEN dimension = 'platform' THEN platform
WHEN dimension = 'flow' THEN flow
ELSE NULL END AS dim_value,
step,  
CASE
WHEN step = 'step_1' THEN 1
WHEN step = 'step_2' THEN 2
WHEN step = 'step_3' THEN 3
WHEN step = 'step_4' THEN 4
WHEN step = 'step_5' THEN 5
WHEN step = 'step_6' THEN 6
ELSE null
END AS step_order,  
CASE
WHEN step = 'step_1' THEN step_1
WHEN step = 'step_2' THEN step_2
WHEN step = 'step_3' THEN step_3
WHEN step = 'step_4' THEN step_4
WHEN step = 'step_5' THEN step_5
WHEN step = 'step_6' THEN step_6
ELSE null
END AS step_sessions,   
CASE
WHEN step = 'step_1' THEN step_2
WHEN step = 'step_2' THEN step_3
WHEN step = 'step_3' THEN step_4
WHEN step = 'step_4' THEN step_5
WHEN step = 'step_5' THEN step_6
WHEN step = 'step_6' THEN null
ELSE null
END AS next_step_sessions 
FROM full_grid
JOIN p1_funnel USING(platform_type, platform, flow)

CROSS JOIN在正确的地方使用时似乎非常有用,尽管我仍然不太确定为什么UNION ALL不能完成这项工作。显而易见,这是有技术原因的,我很想知道为什么,但至少有一个解决方案可以解决这个问题和其他类似的问题

相关内容

最新更新