我有数据,其中一些都包含嵌套列(对象数组的数组(,在Spark 2.2中保存为PARQUET。
现在我正在尝试使用 presto 从外部访问此数据,当我尝试访问任何嵌套列时,我收到以下异常。
com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://name-node/parquet_path/part-00023-8d4f14b1-a3f1-4055-b931-04838701048d-c000.snappy.parquet (offset=0, length=108289): parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:220)
at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:115)
at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:157)
at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:93)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:239)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:56)
at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:90)
at com.facebook.presto.hive.parquet.ParquetPageSource.<init>(ParquetPageSource.java:109)
有趣的是,我能够毫无问题地查询其他非嵌套列。
创建表如下所示:
CREATE TABLE hive.tests.table_name (
not_nested_field_1 BIGINT,
not_nested_field_2 BIGINT,
not_nested_field_3 BOOLEAN,
not_nested_field_4 DOUBLE,
not_nested_field_5 ARRAY(VARCHAR),
not_nested_field_5 ARRAY(ROW(
nested_level0_field1 BOOLEAN,
nested_level0_field2 BIGINT,
nested_level0_field3 BIGINT,
nested_level0_field4 ARRAY(ROW(
nested_level1_field1 BOOLEAN,
nested_level1_field2 BIGINT,
nested_level1_field3 VARCHAR,
nested_level1_field4 ARRAY(ROW(
nested_level2_field1 VARCHAR,
nested_level2_field2 VARCHAR,
nested_level2_field3 ARRAY(ROW(
nested_level3_field1 VARCHAR,
nested_level3_field2 VARCHAR)))),
nested_level1_field5 ARRAY(ROW(
nested_level2_field4 BIGINT,
nested_level2_field5 BIGINT,
nested_level2_field6 ARRAY(ROW(
nested_level3_field3 VARCHAR,
nested_level3_field4 VARCHAR)))))))))
WITH (
format = 'PARQUET',
external_location = 'hdfs://name-node/parquet_path/'
);
使用presto 版本 0.208,使用本地 Hive 元存储创建外部表。
任何帮助将不胜感激:)
此问题已使用catalog/hive.properties
中定义的hive.parquet.use-column-names=true
属性得到解决
默认情况下,presto 将使用列索引来访问数据,因此需要显式定义此属性,以便它将在 parquet 中使用CREATE TABLE
中定义的列名。