Spark dataframe reduceByKey

我使用的是Spark 1.5/1.6，我想在DataFrame中进行reduceByKey操作，我不想将df转换为rdd。

每一行看起来像，我有多行id1。

id1, id2, score, time

我想要一些类似的东西：

id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]

因此，对于每个"id1"，我想要列表中的所有记录

顺便说一句，不想将df转换为rdd的原因是，我必须将这个（减少的）数据帧连接到另一个数据帧，并且我正在对连接键进行重新分区，这使它更快，我想rdd-不能做到这一点

任何帮助都将不胜感激。

要简单地保留已经实现的分区，请在reduceByKey调用中重用父RDD分区器：

 val rdd = df.toRdd
 val parentRdd = rdd.dependencies(0) // Assuming first parent has the 
                                     // desired partitioning: adjust as needed
 val parentPartitioner = parentRdd.partitioner
 val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)

如果要而不是，请按如下方式指定分区器：

 df.toRdd.reduceByKey(reduceFn)  // This is non-optimized: uses full shuffle

那么你注意到的行为就会发生，也就是说，会发生完全的洗牌。这是因为将使用CCD_ 2。

相关内容

最新更新

热门标签：