只有当列在Bigquery中具有非null值时,才将多个列聚合到一个数组中



我有一个如下所示的表:

+----+------+------+------+------+------+
| id | col1 | col2 | col3 | col4 | col5 |
+----+------+------+------+------+------+
| a  |    1 | null | null | null | null |
| b  |    1 | 2    | 3    | 4    | null |
| c  |    1 | 2    | 3    | 4    | 5    |
| d  |    2 | 1    | 7    | null | 4    |
+----+------+------+------+------+------+

我想要创建一个聚合表,其中对于每个id,我想要一个包含所有其他列的非null值的数组。输出应该是这样的:

+-----+-------------+
| id  |   agg_col   |
+-----+-------------+
| a   | [1]         |
| b   | [1,2,3,4]   |
| c   | [1,2,3,4,5] |
| d   | [2,1,7,4]   |
+-----+-------------+

是否可以使用bigquery标准sql生成输出?

以下不是超通用的解决方案,但适用于您提供的特定示例-id以字母数字表示(不是以数字开头(,其余列为数字-整数

#standardSQL
SELECT id,
ARRAY(SELECT * FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(d*)')) col WHERE col != '') AS agg_col_as_array,
CONCAT('[', ARRAY_TO_STRING(ARRAY(SELECT * FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(d*)')) col WHERE col != ''), ','), ']') AS agg_col_as_string
FROM `project.dataset.table` t

你可以使用下面中的问题样本数据来测试、玩上面的游戏

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' id, 1 col1, NULL col2, NULL col3, NULL col4, NULL col5 UNION ALL
SELECT 'b', 1, 2, 3, 4, NULL UNION ALL
SELECT 'c', 1, 2, 3, 4, 5 UNION ALL
SELECT 'd', 2, 1, 7, NULL, 4
)
SELECT id,
ARRAY(SELECT * FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(d*)')) col WHERE col != '') AS agg_col_as_array,
CONCAT('[', ARRAY_TO_STRING(ARRAY(SELECT * FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(d*)')) col WHERE col != ''), ','), ']') AS agg_col_as_string
FROM `project.dataset.table` t
-- ORDER BY id   

结果为

Row id  agg_col_as_array    agg_col_as_string    
1   a   1                   [1]  
2   b   1                   [1,2,3,4]    
2        
3        
4        
3   c   1                   [1,2,3,4,5]  
2        
3        
4        
5        
4   d   2                   [2,1,7,4]    
1        
7        
4        

您认为可以通过提及特定列,然后将它们绑定到数组中来实现这一点吗?

当然,这是可行的-请参阅下面的

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' id, 1 col1, NULL col2, NULL col3, NULL col4, NULL col5 UNION ALL
SELECT 'b', 1, 2, 3, 4, NULL UNION ALL
SELECT 'c', 1, 2, 3, 4, 5 UNION ALL
SELECT 'd', 2, 1, 7, NULL, 4
)
SELECT id, 
ARRAY(
SELECT col 
FROM UNNEST([col1, col2, col3, col4, col5]) col 
WHERE NOT col IS NULL
) AS agg_col_as_array,
CONCAT('[', ARRAY_TO_STRING(
ARRAY(
SELECT CAST(col AS STRING) 
FROM UNNEST([col1, col2, col3, col4, col5]) col 
WHERE NOT col IS NULL
), ','), ']') AS agg_col_as_string
FROM `project.dataset.table` t
-- ORDER BY id       

但是。。。这不是最好的选择,因为您需要针对不同的用途管理和调整每种情况下的列的数量和名称

以下解决方案是我的原始答案的调整版本,以解决您的最新评论-Actually the sample was too simple. Both of my id and other columns have alphanumeric and special characters.

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' id, 1 col1, NULL col2, NULL col3, NULL col4, NULL col5 UNION ALL
SELECT 'b', 1, 2, 3, 4, NULL UNION ALL
SELECT 'c', 1, 2, 3, 4, 5 UNION ALL
SELECT 'd', 2, 1, 7, NULL, 4
)
SELECT id,
ARRAY(
SELECT col 
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(.*?)(?:,|})')) col WITH OFFSET 
WHERE col != 'null' AND OFFSET > 0
) AS agg_col_as_array,
CONCAT('[', ARRAY_TO_STRING(
ARRAY(
SELECT col 
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(.*?)(?:,|})')) col WITH OFFSET 
WHERE col != 'null' AND OFFSET > 0
), ','), ']') AS agg_col_as_string
FROM `project.dataset.table` t
-- ORDER BY id   

两者的结果与之前相同

Row id  agg_col_as_array    agg_col_as_string    
1   a   1                   [1]  
2   b   1                   [1,2,3,4]    
2        
3        
4        
3   c   1                   [1,2,3,4,5]  
2        
3        
4        
5        
4   d   2                   [2,1,7,4]    
1        
7        
4         

最新更新