如何使用Hive 1.2从S3中的Parquet文件创建外部表

我已经在qubole(hive(中创建了一个外部表，该表从s3读取parquet(压缩：snappy(文件，但是在执行 SELECT * table_name 时，我将获得所有列的null值除了分区列。

我尝试在serdeproperties中使用不同的serialization.format值，但我仍面临同一问题。并且在删除属性'serialization.format' = '1'时，我正在获得ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/。

我检查了镶木quet文件，并能够使用parquet-tool读取数据：

**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}

**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
    col2 int,
    col3 string
)
PARTITIONED BY (col1 date) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1   col_2   col_3
5/3/19  NULL    NULL
5/4/19  NULL    NULL
5/5/19  NULL    NULL
5/6/19  NULL    NULL
Expected Result:
col_1   col_2   col_3
5/3/19  1234    ABC
5/4/19  124     FHK
5/5/19  12515   UPO
5/6/19  1234    ABC

编写以下答案，假设该表是使用蜂巢创建并使用spark读取的(因为该问题用 apache-spark-sql标记了(

数据是如何创建的？

Spark支持案例敏感的模式。当我们使用DataFrame API时，可以使用案例敏感架构编写。
示例：

scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
 |-- iD: integer (nullable = true)
 |-- NaMe: string (nullable = true)

请注意，在上面的示例中保留了敏感性。
当我们在Spark创建的数据的顶部创建一个蜂巢表时，Hive将能够正确读取它，因为它不敏感。
当使用SPARK读取相同的数据时，它使用默认情况下的hive的模式，而返回的行是null。
为了克服这一点，Spark引入了Config spark.sql.hive.caseSensitiveInferenceMode。

object HiveCaseSensitiveInferenceMode extends Enumeration {
  val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
  .doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
    "table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
    "formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
    "any table backed by files containing case-sensitive field names or queries may not return " +
    "accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
    "case-sensitive schema from the underlying data files and write it back to the table " +
    "properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
    "properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
    "instead of inferring).")
  .stringConf
  .transform(_.toUpperCase(Locale.ROOT))
  .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
  .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)

INFER_AND_SAVE-作为表的TBLEPROPERTIES的一部分(desc extended <table name>应该揭示这一点(，在Metastore中播放模式并存储在Metastore中如果该属性的值为不是 INFER_AND_SAVE或INFER_ONLY，则Spark使用Metastore表中的架构，并且将无法读取Parquet文件。属性的默认值是INFER_AND_SAVE，因为Spark 2.2.0。

我们可以检查以下内容以查看问题是否与模式灵敏度有关：
1. spark.sql.hive.caseSensitiveInferenceMode的值(spark.sql("set spark.sql.hive.caseSensitiveInferenceMode")应该显示此(
2.如果使用Spark创建的数据
3.如果2为true，请检查模式是否敏感(spark.read(<location>).printSchema(4.如果3使用案例敏感的架构，而从1的输出不是INFER_AND_SAVE/INFER_ONLY，则设置spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE")，丢弃表，重新创建表并尝试从Spark中读取数据。

数据是如何创建的？

相关内容

最新更新

热门标签：