在 Spark 中的数据帧中平展数组

如何将数组展平为包含列[a，b，c，d，e]的数据帧

root
 |-- arry: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: string (nullable = true)
 |    |    |-- d: string (nullable = true)
 |    |    |-- e: long (nullable = true)

任何帮助，不胜感激。

假设，您有一个具有以下结构的 json：

{
  "array": [
    {
      "a": "asdf",
      "b": 1234,
      "c": "a",
      "d": "str",
      "e": 1234
    },
    {
      "a": "asdf",
      "b": 1234,
      "c": "a",
      "d": "str",
      "e": 1234
    },
    {
      "a": "asdf",
      "b": 1234,
      "c": "a",
      "d": "str",
      "e": 1234
    }
  ]
}

读取文件

scala> val nested = spark.read.option("multiline",true).json("nested.json")
nested: org.apache.spark.sql.DataFrame = [array: array<struct<a:string,b:bigint,c:string,d:string,e:bigint>>]

检查架构

scala> nested.printSchema
root
 |-- array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: string (nullable = true)
 |    |    |-- d: string (nullable = true)
 |    |    |-- e: long (nullable = true)

使用explode函数

scala> nested.select(explode($"array").as("exploded")).select("exploded.*").show
+----+----+---+---+----+
|   a|   b|  c|  d|   e|
+----+----+---+---+----+
|asdf|1234|  a|str|1234|
|asdf|1234|  a|str|1234|
|asdf|1234|  a|str|1234|
+----+----+---+---+----+

相关内容

最新更新

热门标签：