Presto unnest/map complex json



我已经在堆栈上探索了几个小时的其他预运行/映射/铸造解决方案的示例,但似乎找不到适用于我的数据的解决方案。

这是我的数据样本:

with test_data (id, messy_json) AS (
VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
('TEST_B', JSON '{"problem":[],"version":4}'),
('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
),

JSON列是半非结构化的,可以容纳多个lvl/并不总是像其他行一样具有每个键:值对。

我在尝试解决方案,比如:

with test_data AS (
select id,
messy_json        
from larger_tbl),

select 
id as id,
json_extract_scalar(test_data, '$.version') as lvl1_version
json_extract_scalar(lvl2, '$.problem') as lvl2_id
from test
LEFT JOIN UNNEST(CAST(json_parse(messy_json) AS array(json))) AS x(lvl1) ON TRUE
LEFT JOIN UNNEST(CAST(json_extract(lvl1, '$.problem') AS array(json))) AS y(lvl2) ON TRUE

这让我出现了铸造错误等。我已经尝试了的一些变体

unnest(cast(json_col as map(varchar, map(varchar,varchar)) options too.

我的目标是用保留的ID和保留在长数据集中的所有键/多lvl键来分解整个数据集。我感谢任何意见/指导,谢谢!

老实说,基于提供的样本数据和声明的目标,您当前的查询没有多大意义。

  • UNNESTcross join一起使用,而不是与left join一起使用(我使用简洁的语法完全跳过关键字(
  • 不需要在测试数据中使用json_parse(messy_json),因为它已经是json了(尽管我假设在实际数据中它只是一个varchar字段,但应该保留它(
  • 根json是一个json对象,所以它不能被强制转换为数组(也没有必要(
  • json_extract_scalar(lvl2, '$.problem') as lvl2_id中的problem应更改为id,因为problem阵列已被取消测试
  • 若要处理问题数组为空的情况,可以使用以下方法:用一个包含1个元素的伪数组来取消对多个数组的测试
-- sample data
with test_data (id, messy_json) AS (
VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
('TEST_B', JSON '{"problem":[],"version":4}'),
('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
)
-- query
select id
, json_extract_scalar(messy_json, '$.version') as lvl1_version
, json_extract_scalar(lvl2, '$.id')            as lvl2_id
from test_data
, UNNEST(CAST(json_extract(messy_json, '$.problem') 
AS array(json)), array[1]) AS y(lvl2, ignored);

输出:

idlvl1_versionlvl2_id
TEST_A42
TEST_A43
TEST_A425
TEST_A427
TEST_A41
TEST_B4NULL
TEST_C4NULL

相关内容

  • 没有找到相关文章

最新更新