如何使用bigquery计算每列项目的频率



我有一个来自Google Bigquery表的列,它具有基于条件(c1、c2、c3…(的某些结果(na、pass、warn和error(。

data    c1      c2      c3      ...
------- ------- ------- ------- -------
data1   pass    na      warn    ...
data2   pass    na      pass    ...
data3   error   pass    error   ...

现在,我想为输入表的每一列获取一个直方图,并创建如下汇总表。

criteria   na      pass    warn    error
---------- ------- ------- ------- -------
c1         0       2       0       1
c2         2       1       0       0
c3         0       1       1       1
...        ...     ...     ...     ...

事实上,有很多条件,所以我想在不枚举条件的情况下创建汇总表。如何获取输出表?

您可以考虑以下内容,

CREATE TEMP FUNCTION json_keys(input STRING) RETURNS Array<String> 
LANGUAGE js AS """
return Object.keys(JSON.parse(input));
""";
CREATE TEMP FUNCTION json_values(input STRING) RETURNS Array<String> 
LANGUAGE js AS """
return Object.values(JSON.parse(input));
""";
SELECT * FROM (
SELECT criteria, result
FROM (SELECT * EXCEPT(data) FROM sample_table) t,
UNNEST (json_keys(TO_JSON_STRING(t))) criteria WITH OFFSET JOIN
UNNEST (json_values(TO_JSON_STRING(t))) result WITH OFFSET USING (offset)
) PIVOT (COUNT(result) FOR result IN ('na', 'pass', 'warn', 'error'));
+----------+----+------+------+-------+
| criteria | na | pass | warn | error |
+----------+----+------+------+-------+
| c1       |  0 |    2 |    0 |     1 |
| c2       |  2 |    1 |    0 |     0 |
| c3       |  0 |    1 |    1 |     1 |
+----------+----+------+------+-------+

您可以取消透视,然后再次透视以获得所需的表:

input as (
select "data1" as data, "pass" as c1,"warn" as c2,"na" as c3, "pass" as c4 union all 
select "data2" as data, "na" as c1,"pass" as c2,"pass" as c3, "error" as c4 union all 
select "data3" as data, "warn" as c1,"pass" as c2,"na" as c3, "warn" as c4 union all 
select "data4" as data, "pass" as c1,"error" as c2,"pass" as c3, "error" as c4 union all 
select "data5" as data, "pass" as c1,"na" as c2,"na" as c3, "error" as c4 union all 
select "data6" as data, "pass" as c1,"warn" as c2,"na" as c3, "na" as c4 
),
step_unpivot as (
select * from input unpivot(results for criterias in (c1,c2,c3,c4))
)
select *  from step_unpivot pivot(count(data) for results in ('pass', 'warn','na','error'))

由于您希望对unpivot((进行动态列选择,因此您可能可以对该部分使用此问题的答案:Bigquery 中的动态unpivot

编辑:下面是一个带有动态反透视的版本


DECLARE myunpivot STRING;
SET myunpivot = (
SELECT CONCAT('(', STRING_AGG( column_name, ','), ')'),
From(
SELECT column_name FROM dataset_name.INFORMATION_SCHEMA.COLUMNS
where table_name ="unpivot_pivot" 
and column_name not in("data")));
EXECUTE IMMEDIATE format("""
with
step_unpivot as (
select * from dataset_name.table_name
unpivot(results for criterias in %s)
)
select *  from step_unpivot pivot(count(data) for results in ('pass', 'warn','na','error'))
""", myunpivot);

最新更新