在BigQuery上开发一个用户定义函数,从混乱的数据集中提取电子邮件,我面临着一个问题,即ARRAY_AGG((不允许出现在临时用户定义函数(UDF(的主体中。
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
ARRAY_AGG(
DISTINCT
(SELECT * FROM
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e where e like '%@%'
) IGNORE NULLS
)[SAFE_OFFSET(index)]
);
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"],1) as email_1
我试图绕过ARRAY_AGG,方法是从带有OFFSET的UNNEST中选择,然后偏移量将作为索引。
但是,现在有一个列限制(标量子查询SELECT子句中不能有多个列(建议使用SELECT AS STRUCT。
我尝试了SELECT AS STRUCT:
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
(SELECT AS STRUCT DISTINCT list.e, list.o FROM
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ", ")
)," ", ""
)
)
) AS list
WITH OFFSET as list.o
WHERE list.e like '%@%' AND list.o = index)
);
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bob@yahoo.com"],1) as email_1
但它不喜欢我的DISTINCT,然后即使删除它,它也会抱怨解析e和o。
所以我在这里没有主意,我可能打了个结。有人能建议如何在UDF中完成这项工作吗?谢谢
以下版本适用于
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
SELECT ARRAY(
SELECT *
FROM UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e WHERE e LIKE '%@%'
)[SAFE_OFFSET(index)]
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1
结果
Row email_1
1 test@gmail.com
或者低于版本(这只是对您原始查询的轻微更正(
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
SELECT ARRAY_AGG(e)[SAFE_OFFSET(index)]
FROM UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e WHERE e LIKE '%@%'
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1
显然具有相同的结果