我正在使用Spark 2.0
我有一个从两个数据帧连接的数据帧,这两个数据帧分别使用 Encoder 从JavaRDD<ORD>, and JavaRDD<Buddy>
转换而来。
The schema of Dataset<Row> converted from JavaRDD<ORD> is schema1 :
root
|-- buddyPathAverageSignalList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- averageSignal: float (nullable = false)
| | |-- buddyId: integer (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- repId: integer (nullable = false)
The schema of Dataset<Row> converted from JavaRDD<Buddy> is schema2:
root
|-- buddyId: integer (nullable = false)
|-- buddyToTgbSignal: float (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- ordId: integer (nullable = false)
|-- ordToBuddySignal: float (nullable = true)
|-- ringStep: integer (nullable = false)
The schema of the joined dataframe is schema3, it is flat
root
|-- buddyId: integer (nullable = false)
|-- buddyToTgbSignal: float (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- ordId: integer (nullable = false)
|-- ordToBuddySignal: float (nullable = true)
|-- ringStep: integer (nullable = false)
|-- buddyPathAverageSignalList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- averageSignal: float (nullable = false)
| | |-- buddyId: integer (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- repId: integer (nullable = false)
现在我想将上面连接的平面数据帧(带有 schema3)转换为JavaRDD<Tuple2<ORD, Buddy>>
或Dataset<Tuple2<ORD, Buddy>>
我找不到一个好的方法。我希望使用类 ORD、Buddy 或/和 Tuple2 的编码器,这样我就不需要逐个解析每个字段,但我没有成功。
Dataset<Tuple2<ORDType, BuddyType>>
具有以下嵌套形式的架构:
scheme of Dataset<Tuple2<ORDType, BuddyType>> scheme4
root
|-- _1: struct (nullable = false)
| |-- buddyPathAverageSignalList: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- averageSignal: float (nullable = true)
| | | |-- buddyId: integer (nullable = true)
| |-- latitude: double (nullable = true)
| |-- longitude: double (nullable = true)
| |-- repId: integer (nullable = true)
|-- _2: struct (nullable = false)
| |-- buddyId: integer (nullable = true)
| |-- buddyToTgbSignal: float (nullable = true)
| |-- latitude: double (nullable = true)
| |-- longitude: double (nullable = true)
| |-- ordId: integer (nullable = true)
| |-- ordToBuddySignal: float (nullable = true)
| |-- ringStep: integer (nullable = true)
有没有办法将带有 schema3 的联接平面数据帧转换为嵌套架构 4,以便我可以将其转换为Dataset<Tuple2<ORD, Buddy>>
?
还有其他建议吗?我知道我可以直接使用数据集与数据集进行连接,但我想测试数据帧,即Dataset<Row>
方式,因为据称性能更好。
数据帧只是 Spark 2.0 中的一个数据集[行]。由于数据帧使用与数据集相同的执行引擎,因此它们的性能应相同。
与您的问题有什么关系 - 我在您的示例中找不到什么是连接键。是纬度/纵向吗?如果是这样,您应该能够使用data1.joinWith(data2, data1("latitude") === data2("latitude") && data1("longitude") === data2("longitude"))
(不确定确切的语法)