我有一个数据帧,如下
scala> df.show
+----+------+
|SLNO|Values|
+----+------+
| A| y|
| A| t|
| A| e|
| B| f|
| C| g|
| B| h|
| C| k|
| C| u|
| B| p|
+----+------+
预期输出为:
SLNO Values
A y,t,e
B f,h,p
C g,k,u
如何通过Spark Scala的DataFrame和DataSet模型来实现这一点?。
我在数据集中尝试了下面这样的东西,但在这个之后被击中了
scala> ds.filter(line=> line.split("t")(0).size <=1 ).map(line => Map(line.split("t")(0) -> line.split("t")(1)))
res86:org.apache.spark.sql.Dataset[scala.collection.immutable.Map[String,String]] = [value: map<string,string>]
//不知道如何进一步分组ByKey
df.createOrReplaceTempView("df")
spark.sql("select SLNO, array_join(collect_list(Values), ',') as Values from df group by SLNO")
检查以下代码。
scala> df.show(false)
+----+------+
|slno|values|
+----+------+
|A |y |
|A |t |
|A |e |
|B |f |
|C |g |
|B |h |
|C |k |
|C |u |
|B |p |
+----+------+
scala> df
.groupBy("slno")
.agg(concat_ws(",",collect_list($"values")).as("values"))
.orderBy($"slno".asc)
.show(false)
+----+------+
|slno|values|
+----+------+
|A |y,t,e |
|B |f,h,p |
|C |g,k,u |
+----+------+
scala> case class Example(slno: String,values:String)
defined class Example
scala> val ds = Seq(Example("A","y"),Example("A","t"),Example("A","e"),Example("B","f"),Example("C","g"),Example("B","h"),Example("C","k"),Example("C","u"),Example("B","p")).toDS
scala> ds
.groupBy("slno")
.agg(concat_ws(",",collect_list($"values")).as("values"))
.orderBy($"slno".asc)
.show(false)
+----+------+
|slno|values|
+----+------+
|A |y,t,e |
|B |f,h,p |
|C |g,k,u |
+----+------+