Spark中structs
(复杂类型)arrays
的模式演变状态如何?
我知道对于常规简单类型的 ORC 或 Parquet 来说效果很好(添加新列),但到目前为止我找不到任何适合我想要的情况的文档。
我的用例是有一个类似于这个的结构:
user_id,date,[{event_time, foo, bar, baz, tag1, tag2, ... future_tag_n}, ...]
我希望能够向数组中的结构添加新字段。
Map
(键值)复杂类型会导致效率低下吗?在那里,我至少可以确定添加新字段(标签)是灵活的。
编辑
case class BarFirst(baz:Int, foo:String)
case class BarSecond(baz:Int, foo:String, moreColumns:Int, oneMore:String)
case class BarSecondNullable(baz:Int, foo:String, moreColumns:Option[Int], oneMore:Option[String])
case class Foo(i:Int, date:String, events:Seq[BarFirst])
case class FooSecond(i:Int, date:String, events:Seq[BarSecond])
case class FooSecondNullable(i:Int, date:String, events:Seq[BarSecondNullable])
val dfInitial = Seq(Foo(1, "2019-01-01", Seq(BarFirst(1, "asdf")))).toDF
dfInitial.printSchema
dfInitial.show
root
|-- i: integer (nullable = false)
|-- date: string (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = false)
| | |-- foo: string (nullable = true)
scala> dfInitial.show
+---+----------+----------+
| i| date| events|
+---+----------+----------+
| 1|2019-01-01|[[1,asdf]]|
+---+----------+----------+
dfInitial.write.partitionBy("date").parquet("my_df.parquet")
tree my_df.parquet
my_df.parquet
├── _SUCCESS
└── date=2019-01-01
└── part-00000-fd77f730-6539-4b51-b680-b7dd5ffc04f4.c000.snappy.parquet
val evolved = Seq(FooSecond(2, "2019-01-02", Seq(BarSecond(1, "asdf", 11, "oneMore")))).toDF
evolved.printSchema
evolved.show
scala> evolved.printSchema
root
|-- i: integer (nullable = false)
|-- date: string (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = false)
| | |-- foo: string (nullable = true)
| | |-- moreColumns: integer (nullable = false)
| | |-- oneMore: string (nullable = true)
scala> evolved.show
+---+----------+--------------------+
| i| date| events|
+---+----------+--------------------+
| 1|2019-01-02|[[1,asdf,11,oneMo...|
+---+----------+--------------------+
import org.apache.spark.sql._
evolved.write.mode(SaveMode.Append).partitionBy("date").parquet("my_df.parquet")
my_df.parquet
├── _SUCCESS
├── date=2019-01-01
│ └── part-00000-fd77f730-6539-4b51-b680-b7dd5ffc04f4.c000.snappy.parquet
└── date=2019-01-02
└── part-00000-64e65d05-3f33-430e-af66-f1f82c23c155.c000.snappy.parquet
val df = spark.read.parquet("my_df.parquet")
df.printSchema
scala> df.printSchema
root
|-- i: integer (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = true)
| | |-- foo: string (nullable = true)
|-- date: date (nullable = true)
缺少其他列!为什么?
df.show
df.as[FooSecond].collect // AnalysisException: No such struct field moreColumns in baz, foo
df.as[FooSecondNullable].collect // AnalysisException: No such struct field moreColumns in baz, foo
针对 Spark 2.2.3_2.11 和 2.4.2_2.12 评估了此行为。
在编辑(上图)后执行代码时,架构合并处于关闭状态,并且不会加载新列。启用架构合并时:
val df = spark.read.option("mergeSchema", "true").parquet("my_df.parquet")
scala> df.printSchema
root
|-- i: integer (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = true)
| | |-- foo: string (nullable = true)
| | |-- moreColumns: integer (nullable = true)
| | |-- oneMore: string (nullable = true)
|-- date: date (nullable = true)
df.as[FooSecond].collect//显然失败 NullPointerException 必须使用选项 df.as[FooSecondNullable].collect//工作正常
现在使用 Hive
evolved.write.mode(SaveMode.Append).partitionBy("date").saveAsTable("my_df")
似乎工作正常(无例外),但是当尝试读回数据时:
spark.sql("describe my_df").show(false)
+-----------------------+---------------------------------+-------+
|col_name |data_type |comment|
+-----------------------+---------------------------------+-------+
|i |int |null |
|events |array<struct<baz:int,foo:string>>|null |
|date |string |null |
|# Partition Information| | |
|# col_name |data_type |comment|
|date |string |null |
+-----------------------+---------------------------------+-------+
当而不是仅使用基本类型的结构数组时:
val first = Seq(Foo(1, "2019-01-01")).toDF
first.printSchema
first.write.partitionBy("dt").saveAsTable("df")
val evolved = Seq(FooEvolved(1,2, "2019-01-02")).toDF
evolved.printSchema
evolved.write.mode(SaveMode.Append).partitionBy("dt").saveAsTable("df")
evolved.write.mode(SaveMode.Append).partitionBy("dt").saveAsTable("df")
org.apache.spark.sql.AnalysisException: The column number of the existing table default.df(struct<first:int,dt:string>) doesn't match the data schema(struct<first:int,second:int,dt:string>);
有一个明确的错误消息问题:是否仍然可以在 Hive 中改进架构?还是需要手动调整架构?
结论
支持结构数组的架构演变,但在读取文件时必须打开合并选项,并且似乎只有在没有 Hive 的情况下直接读取文件时才开箱即用。
从 hive 读取时,仅返回旧架构,因为写入新列时似乎以静默方式删除。
parquet 格式的模式演变(手动创建视图,parquet 不支持的模式演变(重命名、数据类型更改是可能的)的另一个好处)看起来是一个有趣的选择,因为设置为 true 的合并模式选项非常耗费资源,它适用于 Hadoop 上的所有 SQL 引擎。