如何仅从 Spark 数据帧中的嵌套结构中提取某些属性级别

我们希望使用 Spark & Scala 将嵌套的数据结构分解为单独的实体。结构是这样的：

root
|-- timestamp: string (nullable = true)
|-- contract: struct (nullable = true)
|    |-- category: string (nullable = true)
|    |-- contractId: array (nullable = true)
|    |-- items: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- active: boolean (nullable = true)
|    |    |    |-- itemId: string (nullable = true)
|    |    |    |-- subItems: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- elementId: string (nullable = true)

我们希望将合同、项目和子项目放在单独的数据收集中。子实体应包含对其父级的引用，并将顶级字段(时间戳)作为审计字段。

合同：

审计时间戳
类别
合约编号

项目：

审计时间戳
合约 ID (外键)
积极
项标识

子项目：

审计时间戳
itemId (外键)
元素标识

我们不想专门配置所有必要的属性，而只想配置要提取的相应父属性、外键(引用)以及不应提取的内容(例如，合约不应包含项目，项目不应包含子元素)。

我们尝试过dataframe.select("*").select(explode("contract.*"))之类的，但我们做不到。欢迎任何关于如何优雅地做到这一点的想法。

最好亚历克斯

这是关于如何展平一行。"爆炸"函数应该在数组上使用。

dataframe.select("*")
.select(explode("contract.items.*").alias("ci_flat"))
.select("ci_flat.itemId", "ci_flat.subItems")

裁判：在 Spark 中展平行爆炸函数和运算符有什么区别？如何在数据帧中展开数组(来自 JSON)？

相关内容

最新更新

热门标签：