我已经在堆栈上探索了几个小时的其他预运行/映射/铸造解决方案的示例,但似乎找不到适用于我的数据的解决方案。
这是我的数据样本:
with test_data (id, messy_json) AS (
VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
('TEST_B', JSON '{"problem":[],"version":4}'),
('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
),
JSON列是半非结构化的,可以容纳多个lvl/并不总是像其他行一样具有每个键:值对。
我在尝试解决方案,比如:
with test_data AS (
select id,
messy_json
from larger_tbl),
select
id as id,
json_extract_scalar(test_data, '$.version') as lvl1_version
json_extract_scalar(lvl2, '$.problem') as lvl2_id
from test
LEFT JOIN UNNEST(CAST(json_parse(messy_json) AS array(json))) AS x(lvl1) ON TRUE
LEFT JOIN UNNEST(CAST(json_extract(lvl1, '$.problem') AS array(json))) AS y(lvl2) ON TRUE
这让我出现了铸造错误等。我已经尝试了的一些变体
unnest(cast(json_col as map(varchar, map(varchar,varchar)) options too.
我的目标是用保留的ID和保留在长数据集中的所有键/多lvl键来分解整个数据集。我感谢任何意见/指导,谢谢!
老实说,基于提供的样本数据和声明的目标,您当前的查询没有多大意义。
UNNEST
与cross join
一起使用,而不是与left join
一起使用(我使用简洁的语法完全跳过关键字(- 不需要在测试数据中使用
json_parse(messy_json)
,因为它已经是json了(尽管我假设在实际数据中它只是一个varchar字段,但应该保留它( - 根json是一个json对象,所以它不能被强制转换为数组(也没有必要(
json_extract_scalar(lvl2, '$.problem') as lvl2_id
中的problem
应更改为id
,因为problem
阵列已被取消测试- 若要处理问题数组为空的情况,可以使用以下方法:用一个包含1个元素的伪数组来取消对多个数组的测试
-- sample data
with test_data (id, messy_json) AS (
VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
('TEST_B', JSON '{"problem":[],"version":4}'),
('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
)
-- query
select id
, json_extract_scalar(messy_json, '$.version') as lvl1_version
, json_extract_scalar(lvl2, '$.id') as lvl2_id
from test_data
, UNNEST(CAST(json_extract(messy_json, '$.problem')
AS array(json)), array[1]) AS y(lvl2, ignored);
输出:
id | lvl1_version | lvl2_id |
---|---|---|
TEST_A | 4 | 2 |
TEST_A | 4 | 3 |
TEST_A | 4 | 25 |
TEST_A | 4 | 27 |
TEST_A | 4 | 1 |
TEST_B | 4 | NULL |
TEST_C | 4 | NULL |