如何UPIVOT表中的所有列并聚合到数据质量/验证度量中?SQL雪花



我有一个表,里面有60多列,我想UNPIVOT,这样每列都变成一行,然后找到每个条目的填充率、最小值和最大值。

例如

EVENT_ID>03/14/2109/20/21
ID START_DATE END_DATE1PROVIDER_CODE
01 01/2300234010012323
02 06/04/2100259060023454
03 07/20/21

我有两个实现报告的想法。

第一种方法是将所有值强制转换为VARCHAR,然后使用UNPIVOT:

-- Generate dummy data
create or replace table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int) as
select
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null)
from table(generator(rowcount => 1000000000))
;
-- Query
with
cols as (
select column_name, ordinal_position
from information_schema.columns
where table_catalog = current_database()
and table_schema = current_schema()
and table_name = 'T1'
),
stringified as (
select
c1::varchar c1, c2::varchar c2, c3::varchar c3, c4::varchar c4, c5::varchar c5,
c6::varchar c6, c7::varchar c7, c8::varchar c8, c9::varchar c9, c10::varchar c10
from t1
),
data as (
select column_name, column_value
from stringified
unpivot(column_value for column_name in (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10))
)
select
c.column_name,
count(d.column_value)/(select count(*) from t1) fill_rate,
min(d.column_value) min,
max(d.column_value) max
from cols c
left join data d using (column_name)
group by c.column_name, c.ordinal_position
order by c.ordinal_position
;
/*
COLUMN_NAME FILL_RATE   MIN MAX
C1  0.500000    -1000000069270747870    999999972962694409
C2  0.499980    -1000000027928146782    999999946877079818
C3  0.499996    -1000000012155323098    999999942281548701
C4  0.500017    -1000000056353213091    999999946421698482
C5  0.500015    -1000000015608859996    999999993977648967
C6  0.500003    -1000000007081089270    999999998851014730
C7  0.499987    -100000008605944993 999999968272328033
C8  0.499992    -1000000042470913027    999999977402822725
C9  0.500011    -1000000058928465662    999999969060696774
C10 0.500029    -1000000011306371004    99999996061390938
*/

这是一种简单的方法,但它仍然需要列出两次所有列名,而且在列数非常庞大的情况下,这有点困难(但我相信这比一个巨大的UNIONALL查询要好得多(。


另一个解决方案有点棘手,但如果行长度不超过VARIANT值限制(16 MiB(,则可以使用OBJECT_CONSTRUCT(*)聚合来取消透视表:

with
cols as (
select column_name, ordinal_position
from information_schema.columns
where table_catalog = current_database()
and table_schema = current_schema()
and table_name = 'T1'
),
data as (
select f.key column_name, f.value::varchar column_value
from (select object_construct(*) rec from t1) up,
lateral flatten(up.rec) f
)
select
c.column_name,
count(d.column_value)/(select count(*) from t1) fill_rate,
min(d.column_value) min,
max(d.column_value) max
from cols c
left join data d using (column_name)
group by c.column_name, c.ordinal_position
order by c.ordinal_position
;
/*
COLUMN_NAME FILL_RATE   MIN MAX
C1  0.500000    -1000000069270747870    999999972962694409
C2  0.499980    -1000000027928146782    999999946877079818
C3  0.499996    -1000000012155323098    999999942281548701
C4  0.500017    -1000000056353213091    999999946421698482
C5  0.500015    -1000000015608859996    999999993977648967
C6  0.500003    -1000000007081089270    999999998851014730
C7  0.499987    -100000008605944993 999999968272328033
C8  0.499992    -1000000042470913027    999999977402822725
C9  0.500011    -1000000058928465662    999999969060696774
C10 0.500029    -1000000011306371004    99999996061390938
*/

OBJECT_CONSTRUCT(*)聚合是OBJECT_CONSTRUCT函数的一种特殊用法,它提取列名作为每个JSON对象的键。据我所知,这是以编程方式从表中提取列名和值的唯一方法。

由于OBJECT_CONSTRUCT是一个相对繁重的操作,它通常比第一个解决方案花费更长的时间,但您不需要使用此技巧来编写所有列名。

最新更新