我需要在镶木quet文件中的深度嵌套数据结构中选择元素。镶木quet文件的模式如下:
root
|-- descriptor_type: string (nullable = true)
|-- src_date: long (nullable = true)
|-- downloaded: long (nullable = true)
|-- exit_nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- fingerprint: string (nullable = true)
| | |-- published: long (nullable = true)
| | |-- last_status: long (nullable = true)
| | |-- exit_adresses: map (nullable = true)
| | | |-- key: string
| | | |-- value: long (valueContainsNull = true)
数据集中的一个条目,序列化为JSON,看起来像这样:
{
"descriptor_type": "tordnsel 1.0",
"src_date": 1472781720000,
"downloaded": 1472781720000,
"exit_nodes": [
{
"fingerprint": "CECCFA65F3EB16CA8C0F9EAC9050C348515E26C5",
"published": 1472713568000,
"last_status": 1472716961000,
"exit_adresses": {
"178.217.187.39": 1472717419000
}
},
...
我正在使用Spark 2.0集成在SnappyData中的0.6,其中镶木quet文件是这样的:
snappy> CREATE EXTERNAL TABLE stage USING PARQUET OPTIONS (path './testdata.parquet.snappy');
选择第一行会产生以下结果:
snappy> select * from stage limit 1;
descriptor_type|src_date |downloaded |exit_nodes
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tordnsel 1.0 |1472781720000 |1472781720000 |5704000060110000e011000060120000d812000058130000d813000058140000d014000050150000d015000050160000d016000048170000c81700004018000&
字段" exit_nodes'仅包含一个长字符串,而不是我天真地希望的结构数组。
我可以在索引中选择" exit_nodes"数组中的特定元素:
snappy> select exit_nodes[0].fingerprint, exit_nodes[0].published, exit_nodes[0].exit_adresses from stage limit 1;
EXIT_NODES[0].FINGERPRINT |EXIT_NODES[0].PUBLISHED|EXIT_NODES[0].EXIT_ADRESSES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3D28E5FBD0C670C004E59D6CFDE7305BC8948FA8 |1472750744000 |15000000010000000800000037382e3134322e31392e3231330100000008000000b057f0e656010000
使用" exit_adresses"地图,我没有运气:
snappy> select exit_nodes[0].exit_adresses.key from stage limit 1;
EXIT_NODES[0].EXIT_ADRESSES[KEY]
--------------------------------
NULL
所以问题是:
- 如何在一个" exit_adresses"地图中选择键和值?
- 如何在"元素的数组"或"嵌套地图"中的所有键值配对中选择所有记录,以将它们从镶木点文件导入RDBMS?
我没有直接的答案,但是恕我直言,没有进一步的支持对嵌套的镶木式类型以外的嵌套镶木类型
这几乎涵盖了您可以做的一切:https://github.com/apache/spark/blob/master/sql/core/src/src/src/scala/scala/org/apache/spark/spark/sql/sql/secution/datasources/datasources/parquet/parquet/parquetquerquerysuite.scala-scala-scala-