BigQuery在一个函数中的50个不同表上重复相同的子查询50次



提前为文字墙道歉。这里有很多重复的代码,而且帖子并不像看上去那么长。

我之前发布了一个相关的stackoverflow问题,这很有帮助,但现在我有了一个单独的重塑用例。我有一个表太宽,结构太多,无法手动重塑它。相反,我有一种方法,即(a(从表中手动选择每个嵌套结构,(b(重塑它(并添加其他几列(,以及(c(UNION ALL'将它们重新组合在一起。部分(a(和(b(看起来像这样:

WITH 
t1_o as (SELECT season, game, team, t1.o.* FROM `mydatabase`),
t2_o as (SELECT season, game, team, t2.o.* FROM `mydatabase`),
t3_o as (SELECT season, game, team, t3.o.* FROM `mydatabase`),
t4_o as (SELECT season, game, team, t4.o.* FROM `mydatabase`),
t5_o as (SELECT season, game, team, t5.o.* FROM `mydatabase`),
t1_d as (SELECT season, game, team, t1.d.* FROM `mydatabase`),
t2_d as (SELECT season, game, team, t2.d.* FROM `mydatabase`),
t3_d as (SELECT season, game, team, t3.d.* FROM `mydatabase`),
t4_d as (SELECT season, game, team, t4.d.* FROM `mydatabase`),
t5_d as (SELECT season, game, team, t5.d.* FROM `mydatabase`),
-- Need to reshape EACH of the underlying structs
t1_o_long as (
SELECT 
season, game, team, true as is_o, 't1' as table_name, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT 
season, game, team,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM t1_o,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t1_o), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('season', 'game', 'team')
),
t2_o_long as (
SELECT 
season, game, team, true as is_o, 't2' as table_name, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT 
season, game, team,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM t2_o,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t2_o), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('season', 'game', 'team')
),
... 

我的问题是,是否可以创建一个函数,让我们称之为wideToLong,这将允许我执行以下操作:

WITH 
t1_o as (SELECT season, game, team, t1.o.* FROM `mydatabase`),
t2_o as (SELECT season, game, team, t2.o.* FROM `mydatabase`),
t3_o as (SELECT season, game, team, t3.o.* FROM `mydatabase`),
t4_o as (SELECT season, game, team, t4.o.* FROM `mydatabase`),
t5_o as (SELECT season, game, team, t5.o.* FROM `mydatabase`),
t1_d as (SELECT season, game, team, t1.d.* FROM `mydatabase`),
t2_d as (SELECT season, game, team, t2.d.* FROM `mydatabase`),
t3_d as (SELECT season, game, team, t3.d.* FROM `mydatabase`),
t4_d as (SELECT season, game, team, t4.d.* FROM `mydatabase`),
t5_d as (SELECT season, game, team, t5.d.* FROM `mydatabase`),
-- Need to reshape EACH of the underlying structs
t1_o_long as (wideToLong(table = t1_o, is_o = true, name = 't1')),
t2_o_long as (wideToLong(table = t2_o, is_o = true, name = 't2')),
t3_o_long as (wideToLong(table = t3_o, is_o = true, name = 't3')),
t4_o_long as (wideToLong(table = t4_o, is_o = true, name = 't4')),
t5_o_long as (wideToLong(table = t5_o, is_o = true, name = 't5')),
t1_d_long as (wideToLong(table = t1_d, is_o = false, name = 't1')),
t2_d_long as (wideToLong(table = t2_d, is_o = false, name = 't2')),
t3_d_long as (wideToLong(table = t3_d, is_o = false, name = 't3')),
t4_d_long as (wideToLong(table = t4_d, is_o = false, name = 't4')),
t5_d_long as (wideToLong(table = t5_d, is_o = false, name = 't5'))
... 

其中wideToLong包含整形功能:

SELECT 
season, game, team, true as is_o, 't1' as table_name, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT 
season, game, team,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM t1_o,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t1_o), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('season', 'game', 'team')
)

根据上一篇stackoverflow文章,我可以先将表保存在传统模式下以使其变平,但(a(我想为重新整形的数据添加额外的列(is_o和table_name(,(b(我想了解是否可以为这样的重复任务创建函数,(c(以遗留模式保存表可以进行一次整形,但不太适合我公司使用的日常数据管道。学习如何为这个重复的任务创建一个函数将是非常棒的。

谢谢。

是否可以创建一个函数,让我们称之为wideToLong,这将允许我执行以下操作?

UDF(用户定义函数(接受列作为输入并返回值。wideToLong函数期望接收一个表作为参数,并返回另一个表,这在BigQuery中还不支持。您可以尝试操作这些表,将它们用作结构数组,但这可能会使管理起来非常复杂。

您提到这些表非常不同,因此,我的建议是使用structs(t1.o,t2.d,…(将它们转换为稍后可以UNION和"wideToLong"一次的东西。

最新更新