找出这个Spark Scala数据帧的分组逻辑



我有一个Spark DataFrame(DF(,它看起来像这样:

----------------------------------------------------------
id | b                      | c  | d          | e         |
----------------------------------------------------------
1  | "ok"                   | 9  | "dontcare" | "dontcare"|
1  | "not ok"               | 10 | "dontcare" | "dontcare"|
1  | "sure"                 | 1  | "dontcare" | "dontcare"|
2  | "not sure"             | 2  | "dontcare" | "dontcare"|
2  | "not so sure"          | 12 | "dontcare" | "dontcare"|
1  | "sure bleh"            | 1  | "dontcare" | "dontcare"|
3  | "not sure"             | 5  | "dontcare" | "dontcare"|
3  | "not so sure"          | 25 | "dontcare" | "dontcare"|
----------------------------------------------------------

我正试图通过以下方式在Spark Scala中转换这个DF来创建一个新的DF:

----------------------------------------------------------------------------
id | grouping                                                       | count |
----------------------------------------------------------------------------
1  | (("ok",9),("not ok", 10), ("sure", 1), ("sure bleh", 1))       |   4   |        
2  | (("not sure",2),("not so sure" , 12))                          |   2   |        
3  | (("not sure",5),("not so sure" , 25))                          |   2   |        
----------------------------------------------------------------------------

使用Spark Scala创建这个DF的最佳方法是什么?我在试图弄清楚这种分组逻辑时陷入了困境。到目前为止,我已经尝试过了:

val df = spark.read.option("header","true").option("delimiter","t").csv("test.csv")
df.show
val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping")))

首先将列作为数组类型,然后使用groupBy

val df = spark.read.option("header","true").option("delimiter","t").csv("test.csv")
df.show
val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping"), count("*").as("count")).orderBy("id")
finalDF.show(false)
finalDF.printSchema
+---+--------------------------------------------------+-----+
|id |grouping                                          |count|
+---+--------------------------------------------------+-----+
|1  |[[ok, 9], [not ok, 10], [sure, 1], [sure bleh, 1]]|4    |
|2  |[[not sure, 2], [not so sure, 12]]                |2    |
|3  |[[not sure, 5], [not so sure, 25]]                |2    |
+---+--------------------------------------------------+-----+
root
|-- id: string (nullable = true)
|-- grouping: array (nullable = true)
|    |-- element: array (containsNull = true)
|    |    |-- element: string (containsNull = true)
|-- count: long (nullable = false)

最新更新