Spark - 在 withColumn( "newCol" , collect_list(...)) 选择具有多个元素的行之后



我正在使用此JSON创建的数据帧:

{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "39"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}

最初看起来像这样的数据框:

+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 25|1205|  mendy|
| 24|1206|    rob|
| 23|1207| prudvi|
+---+----+-------+

我需要的是将所有年龄相同的学生分组,并根据他们的ID订购。到目前为止,这就是我接近的方式:

*注意:我很确定,与使用withColumn("newCol", ..)添加新列相比,使用select("newCol")是更有效的方法,但是我不知道如何更好地解决它

 val conf = new SparkConf().setAppName("SimpleApp").set("spark.driver.allowMultipleContexts", "true").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    val df = sqlContext.read.json("students.json")
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions._
    val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("List")

我得到的输出是:

[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([24,1206,rob])]
[WrappedArray([23,1204,javed])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([28,1202,krishna])]
[WrappedArray([39,1203,amith])]

现在,如何过滤有多个元素的行?也就是说,我希望我的最终数据框架是:

[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]

到目前为止,我最好的方法是:

val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id")))
val filterd = mergedDF.withColumn("count", count("age").over(Window.partitionBy("age"))).filter($"count" > 1).select("newCol")

但是我一定缺少一些东西,因为结果不是预期的:

[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([25,1201,satish])]
[WrappedArray([25,1201,satish], [25,1205,mendy])]

您可以使用size()过滤数据:

import org.apache.spark.sql.functions.{col,size}
mergedDF.filter(size(col("newCol"))>1).show(false)
+---+----+------+-----------------------------------+
|age|id  |name  |newCol                             |
+---+----+------+-----------------------------------+
|23 |1207|prudvi|[[23,1204,javed], [23,1207,prudvi]]|
|25 |1205|mendy |[[25,1201,satish], [25,1205,mendy]]|
+---+----+------+-----------------------------------+

相关内容

  • 没有找到相关文章

最新更新