我有一个Spark DataFrame(DF(,它看起来像这样:
----------------------------------------------------------
id | b | c | d | e |
----------------------------------------------------------
1 | "ok" | 9 | "dontcare" | "dontcare"|
1 | "not ok" | 10 | "dontcare" | "dontcare"|
1 | "sure" | 1 | "dontcare" | "dontcare"|
2 | "not sure" | 2 | "dontcare" | "dontcare"|
2 | "not so sure" | 12 | "dontcare" | "dontcare"|
1 | "sure bleh" | 1 | "dontcare" | "dontcare"|
3 | "not sure" | 5 | "dontcare" | "dontcare"|
3 | "not so sure" | 25 | "dontcare" | "dontcare"|
----------------------------------------------------------
我正试图通过以下方式在Spark Scala中转换这个DF来创建一个新的DF:
----------------------------------------------------------------------------
id | grouping | count |
----------------------------------------------------------------------------
1 | (("ok",9),("not ok", 10), ("sure", 1), ("sure bleh", 1)) | 4 |
2 | (("not sure",2),("not so sure" , 12)) | 2 |
3 | (("not sure",5),("not so sure" , 25)) | 2 |
----------------------------------------------------------------------------
使用Spark Scala创建这个DF的最佳方法是什么?我在试图弄清楚这种分组逻辑时陷入了困境。到目前为止,我已经尝试过了:
val df = spark.read.option("header","true").option("delimiter","t").csv("test.csv")
df.show
val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping")))
首先将列作为数组类型,然后使用groupBy
。
val df = spark.read.option("header","true").option("delimiter","t").csv("test.csv")
df.show
val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping"), count("*").as("count")).orderBy("id")
finalDF.show(false)
finalDF.printSchema
+---+--------------------------------------------------+-----+
|id |grouping |count|
+---+--------------------------------------------------+-----+
|1 |[[ok, 9], [not ok, 10], [sure, 1], [sure bleh, 1]]|4 |
|2 |[[not sure, 2], [not so sure, 12]] |2 |
|3 |[[not sure, 5], [not so sure, 25]] |2 |
+---+--------------------------------------------------+-----+
root
|-- id: string (nullable = true)
|-- grouping: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- count: long (nullable = false)