用户定义函数(标准SQL)中不允许使用ARRAY_AGG



在BigQuery上开发一个用户定义函数,从混乱的数据集中提取电子邮件,我面临着一个问题,即ARRAY_AGG((不允许出现在临时用户定义函数(UDF(的主体中。

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
ARRAY_AGG(
DISTINCT 
(SELECT * FROM 
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e where e like '%@%'
) IGNORE NULLS
)[SAFE_OFFSET(index)]
);
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"],1) as email_1

我试图绕过ARRAY_AGG,方法是从带有OFFSET的UNNEST中选择,然后偏移量将作为索引。

但是,现在有一个列限制(标量子查询SELECT子句中不能有多个列(建议使用SELECT AS STRUCT。

我尝试了SELECT AS STRUCT:

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (

(SELECT AS STRUCT DISTINCT list.e, list.o FROM 
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ", ")
)," ", ""
)
)
) AS list
WITH OFFSET as list.o
WHERE list.e like '%@%' AND list.o = index)
);
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bob@yahoo.com"],1) as email_1

但它不喜欢我的DISTINCT,然后即使删除它,它也会抱怨解析e和o。

所以我在这里没有主意,我可能打了个结。有人能建议如何在UDF中完成这项工作吗?谢谢

以下版本适用于

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
SELECT ARRAY(
SELECT * 
FROM UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e WHERE e LIKE '%@%'
)[SAFE_OFFSET(index)]
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1

结果

Row email_1  
1   test@gmail.com   

或者低于版本(这只是对您原始查询的轻微更正(

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
SELECT ARRAY_AGG(e)[SAFE_OFFSET(index)] 
FROM UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e WHERE e LIKE '%@%'
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1     

显然具有相同的结果

最新更新