如何将数组展平为包含列[a,b,c,d,e]的数据帧
root
|-- arry: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
| | |-- e: long (nullable = true)
任何帮助,不胜感激。
假设,您有一个具有以下结构的 json:
{
"array": [
{
"a": "asdf",
"b": 1234,
"c": "a",
"d": "str",
"e": 1234
},
{
"a": "asdf",
"b": 1234,
"c": "a",
"d": "str",
"e": 1234
},
{
"a": "asdf",
"b": 1234,
"c": "a",
"d": "str",
"e": 1234
}
]
}
- 读取文件
scala> val nested = spark.read.option("multiline",true).json("nested.json")
nested: org.apache.spark.sql.DataFrame = [array: array<struct<a:string,b:bigint,c:string,d:string,e:bigint>>]
- 检查架构
scala> nested.printSchema
root
|-- array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
| | |-- e: long (nullable = true)
- 使用
explode
函数
scala> nested.select(explode($"array").as("exploded")).select("exploded.*").show
+----+----+---+---+----+
| a| b| c| d| e|
+----+----+---+---+----+
|asdf|1234| a|str|1234|
|asdf|1234| a|str|1234|
|asdf|1234| a|str|1234|
+----+----+---+---+----+