根据apachespark中的最大值从键值对返回键

我是apache spark的新手，需要一些建议。我有一个类型为[String，Int]的RDD。RDD值如下：

（"A，x"，3）
（"A，y"，4）
（"A，z"，1）
（"B，y"，2）
（"C，w"，5）
（"C，y"，2）
（"E，x"，1）
（"E，z"，3）

我想要完成的是获得这样的RDD（String，String）：

（"A"，"y"）//在包含A的键中，（A，y）具有最大值
（"B"，"y"）//在包含B的键中，（B，y）具有最大值
（"C"，"w"）//在包含C的键中，（C，w）具有最大值
（"E"，"z"）//在包含E的键中，（E，z）具有最大值

我在flatMap中尝试了一个循环概念（通过使用计数器），但它不起作用。有简单的方法吗？

只需整形和reduceByKey:

val pattern = "^(.*?),(.*?)$".r
rdd
  // Split key into parts
  .flatMap{ case (pattern(x, y), z) => Some((x, (y, z))) }
  // Reduce by first part of the key
  .reduceByKey( (a, b) => if (a._2 > b._2) a else b )
  // Go back to the original shape
  .map { case (x, (y, z)) => (s"$x,$y", z) }

您可以使用groupBy Key，然后使用maxBy函数来获得输出

val data = Array(("A,x", 3),("A,y", 4),("A,z", 1),("B,y", 2),("C,w", 5),("C,y", 2),("E,x", 1),("E,z", 3))
val rdd = sc.makeRDD(data).map(i => { // Paralleling the sample data
val t = i._1.split(",")  // Splitting the String by , 
t(0) ->(t(1), i._2) // Transforming String,Int to String,(String,Int)
}).groupByKey().map(i => { // Performing a groupBy key 
(i._1, i._2.maxBy(_._2)._1) // returning the Max value by the Int being passed using the maxBy function 
})
rdd.foreach(println(_)) // Printing the output

相关内容

最新更新

热门标签：