如何将新字段添加到两级嵌套结构列中



我有一个数据帧,其模式如下

root
|-- ts: timestamp (nullable = true)
|-- address_list: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- id: string (nullable = true)
|    |    |-- active: integer (nullable = true)
|    |    |-- address: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- street: string (nullable = true)
|    |    |    |    |-- city: long (nullable = true)
|    |    |    |    |-- state: integer (nullable = true)

想在street和city之间的嵌套列address_list.address中添加一个新字段street_2。

以下是预期的模式

root
|-- ts: timestamp (nullable = true)
|-- address_list: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- id: string (nullable = true)
|    |    |-- active: integer (nullable = true)
|    |    |-- address: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- street: string (nullable = true)
|    |    |    |    |-- street_2: string (nullable = true)
|    |    |    |    |-- city: long (nullable = true)
|    |    |    |    |-- state: integer (nullable = true)

我确实尝试过使用transform,但它在末尾的address_list中添加了street_2字段

df
.withColumn("address_list",transform(col("address_list"), x => x.withField("street_2", lit(null).cast(string))))
root
|-- ts: timestamp (nullable = true)
|-- address_list: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- id: string (nullable = true)
|    |    |-- active: integer (nullable = true)
|    |    |-- address: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- street: string (nullable = true)
|    |    |    |    |-- city: long (nullable = true)
|    |    |    |    |-- state: integer (nullable = true)
|    |    |-- street_2: string (nullable = true)

我想把它放在地址里面,插入街道和城市之间的

你可以试试这个:


data.printSchema
val result = data.withColumn(
"person_details", 
transform(col("person_details"), x => x.withField("person.details.age", lit(40))))
result.printSchema
root
|-- person_details: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- person: struct (nullable = true)
|    |    |    |-- name: string (nullable = true)
|    |    |    |-- details: struct (nullable = true)
|    |    |    |    |-- city: string (nullable = true)
|    |    |    |    |-- income: long (nullable = false)
root
|-- person_details: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- person: struct (nullable = true)
|    |    |    |-- name: string (nullable = true)
|    |    |    |-- details: struct (nullable = true)
|    |    |    |    |-- city: string (nullable = true)
|    |    |    |    |-- income: long (nullable = false)
|    |    |    |    |-- age: integer (nullable = false)

我从这篇帖子中得到了帮助:https://medium.com/@fqaiser94/操作嵌套数据调整-电容器中的光电放大器-spark-3-1-1-f88bc9003827

最新更新