如何将Avro记录数组加载到Vertica array中



我的avro文件包含以下列:

{"name":"my_column","type":["null",{"type":"array","items":{"type":"record","name":"my_column","namespace":"v11","fields":[{"name":"my_column","type":["null","int"],"default":null}]}}],"default":null}

我将数据加载到Vertica中并存储为VARBINARY。例子:

db=> select MapToString(my_column) from tab limit 1;
MapToString
------------------------------------------------------------------------------------------------------------------------------------------------------
{
"0.__name__": "my_column",
"0.my_column": "5",
"1.__name__": "my_column",
"1.my_column": "9"
}
(1 row)

数据实际上可以简化为ARRAY[INT]。(即ARRAY[5,9]).

执行这个转换的正确方法是什么?通过UDTF或UDParser扩展Vertica ?通过SQL执行此转换?别的吗?

编辑:我要检查标量UDF是否可以嵌入到AVROPARSER旁边的COPY命令中,或者它是否需要额外的ETL。

谢谢!

尝试物化列-好吧,您需要知道在flex表中期望什么…

我有这个表:

SELECT                                                                                                                                                                     
__id__
, REGEXP_REPLACE(MAPTOSTRING(__raw__),'s+',' ') AS rawasstring
FROM a;
-- out  __id__ |                                           rawasstring                                            
-- out --------+--------------------------------------------------------------------------------------------------
-- out         | { "0.__name__": "my_column", "0.my_column": "5", "1.__name__": "my_column", "1.my_column": "9" }
-- out (1 row)

然后,我只是添加了一个物化列,像这样:

ALTER TABLE a ADD int_array ARRAY[int,10] 
DEFAULT ARRAY[
MAPLOOKUP(__raw__, '0.my_column')
, MAPLOOKUP(__raw__, '1.my_column')
, MAPLOOKUP(__raw__, '2.my_column')
, MAPLOOKUP(__raw__, '3.my_column')
, MAPLOOKUP(__raw__, '4.my_column')
, MAPLOOKUP(__raw__, '5.my_column')
, MAPLOOKUP(__raw__, '6.my_column')
, MAPLOOKUP(__raw__, '7.my_column')
, MAPLOOKUP(__raw__, '8.my_column')
, MAPLOOKUP(__raw__, '9.my_column')
]::ARRAY[INT,10];                                                                                                                                                           

Map中不存在的键为NULL,并且没有真正添加到数组中。现在我有了数组:

SELECT                                                                                                                                                                     
__id__::VARCHAR
, REGEXP_REPLACE(MAPTOSTRING(__raw__),'s+',' ') AS rawasstring
, int_array
FROM a ;
-- out  __id__ |                                           rawasstring                                            | int_array 
-- out --------+--------------------------------------------------------------------------------------------------+-----------
-- out         | { "0.__name__": "my_column", "0.my_column": "5", "1.__name__": "my_column", "1.my_column": "9" } | [5,9]
-- out (1 row)

最新更新