如何在 Spark 中将平面数据帧转换为元组 2<S,T> 类型数据集



我正在使用Spark 2.0

我有一个从两个数据帧连接的数据帧,这两个数据帧分别使用 Encoder 从JavaRDD<ORD>, and JavaRDD<Buddy>转换而来。

The schema of Dataset<Row> converted from JavaRDD<ORD> is schema1 :
root
|-- buddyPathAverageSignalList: array (nullable = true)
|       |-- element: struct (containsNull = true)
|       |       |-- averageSignal: float (nullable = false)
|       |       |-- buddyId: integer (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- repId: integer (nullable = false)

The schema of Dataset<Row> converted from JavaRDD<Buddy> is schema2:
root
|-- buddyId: integer (nullable = false)
|-- buddyToTgbSignal: float (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- ordId: integer (nullable = false)
|-- ordToBuddySignal: float (nullable = true)
|-- ringStep: integer (nullable = false)
The schema of the joined dataframe is schema3, it is flat
root
|-- buddyId: integer (nullable = false)
|-- buddyToTgbSignal: float (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- ordId: integer (nullable = false)
|-- ordToBuddySignal: float (nullable = true)
|-- ringStep: integer (nullable = false)
|-- buddyPathAverageSignalList: array (nullable = true)
|       |-- element: struct (containsNull = true)
|       |       |-- averageSignal: float (nullable = false)
|       |       |-- buddyId: integer (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- repId: integer (nullable = false)

现在我想将上面连接的平面数据帧(带有 schema3)转换为JavaRDD<Tuple2<ORD, Buddy>>Dataset<Tuple2<ORD, Buddy>>

我找不到一个好的方法。我希望使用类 ORD、Buddy 或/和 Tuple2 的编码器,这样我就不需要逐个解析每个字段,但我没有成功。

Dataset<Tuple2<ORDType, BuddyType>>具有以下嵌套形式的架构:

scheme of Dataset<Tuple2<ORDType, BuddyType>> scheme4 
root
|-- _1: struct (nullable = false)
|       |-- buddyPathAverageSignalList: array (nullable = true)
|       |       |-- element: struct (containsNull = false)
|       |       |       |-- averageSignal: float (nullable = true)
|       |       |       |-- buddyId: integer (nullable = true)
|       |-- latitude: double (nullable = true)
|       |-- longitude: double (nullable = true)
|       |-- repId: integer (nullable = true)
|-- _2: struct (nullable = false)
|       |-- buddyId: integer (nullable = true)
|       |-- buddyToTgbSignal: float (nullable = true)
|       |-- latitude: double (nullable = true)
|       |-- longitude: double (nullable = true)
|       |-- ordId: integer (nullable = true)
|       |-- ordToBuddySignal: float (nullable = true)
|       |-- ringStep: integer (nullable = true)

有没有办法将带有 schema3 的联接平面数据帧转换为嵌套架构 4,以便我可以将其转换为Dataset<Tuple2<ORD, Buddy>>

还有其他建议吗?我知道我可以直接使用数据集与数据集进行连接,但我想测试数据帧,即Dataset<Row>方式,因为据称性能更好。

数据帧只是 Spark 2.0 中的一个数据集[行]。由于数据帧使用与数据集相同的执行引擎,因此它们的性能应相同。

与您的问题有什么关系 - 我在您的示例中找不到什么是连接键。是纬度/纵向吗?如果是这样,您应该能够使用data1.joinWith(data2, data1("latitude") === data2("latitude") && data1("longitude") === data2("longitude"))(不确定确切的语法)

最新更新