Spark Scala DataFrame和DataSet实现所需输出



我有一个数据帧,如下

scala> df.show
+----+------+
|SLNO|Values|
+----+------+
|   A|     y|
|   A|     t|
|   A|     e|
|   B|     f|
|   C|     g|
|   B|     h|
|   C|     k|
|   C|     u|
|   B|     p|
+----+------+

预期输出为:

SLNO Values
A    y,t,e
B    f,h,p
C    g,k,u

如何通过Spark Scala的DataFrame和DataSet模型来实现这一点?。

我在数据集中尝试了下面这样的东西,但在这个之后被击中了

scala> ds.filter(line=> line.split("t")(0).size <=1 ).map(line => Map(line.split("t")(0) -> line.split("t")(1)))
res86:org.apache.spark.sql.Dataset[scala.collection.immutable.Map[String,String]] = [value: map<string,string>]

//不知道如何进一步分组ByKey

df.createOrReplaceTempView("df")
spark.sql("select SLNO, array_join(collect_list(Values), ',') as Values from df group by SLNO") 

检查以下代码。

scala> df.show(false)
+----+------+
|slno|values|
+----+------+
|A   |y     |
|A   |t     |
|A   |e     |
|B   |f     |
|C   |g     |
|B   |h     |
|C   |k     |
|C   |u     |
|B   |p     |
+----+------+

scala> df
.groupBy("slno")
.agg(concat_ws(",",collect_list($"values")).as("values"))
.orderBy($"slno".asc)
.show(false)
+----+------+
|slno|values|
+----+------+
|A   |y,t,e |
|B   |f,h,p |
|C   |g,k,u |
+----+------+
scala> case class Example(slno: String,values:String)
defined class Example
scala> val ds = Seq(Example("A","y"),Example("A","t"),Example("A","e"),Example("B","f"),Example("C","g"),Example("B","h"),Example("C","k"),Example("C","u"),Example("B","p")).toDS
scala> ds
.groupBy("slno")
.agg(concat_ws(",",collect_list($"values")).as("values"))
.orderBy($"slno".asc)
.show(false)
+----+------+
|slno|values|
+----+------+
|A   |y,t,e |
|B   |f,h,p |
|C   |g,k,u |
+----+------+

最新更新