我在Spark 2.2和Scala 2.11.8中具有以下数据帧:
+--------+---------+-------+-------+----+-------+
|event_id|person_id|channel| group|num1| num2|
+--------+---------+-------+-------+----+-------+
| 560| 9410| web| G1| 0| 5|
| 290| 1430| web| G1| 0| 3|
| 470| 1370| web| G2| 0| 18|
| 290| 1430| web| G2| 0| 5|
| 290| 1430| mob| G2| 1| 2|
+--------+---------+-------+-------+----+-------+
这是Scala中的等效数据框:
df = sqlCtx.createDataFrame(
[(560,9410,"web","G1",0,5),
(290,1430,"web","G1",0,3),
(470,1370,"web","G2",0,18),
(290,1430,"web","G2",0,5),
(290,1430,"mob","G2",1,2)],
["event_id","person_id","channel","group","num1","num2"]
)
列group
只能有两个值:G1
和G2
。我需要将group
列的这些值转换为新列,如下所示:
+--------+---------+-------+--------+-------+--------+-------+
|event_id|person_id|channel| num1_G1|num2_G1| num1_G2|num2_G2|
+--------+---------+-------+--------+-------+--------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 0|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| web| 0| 0| 0| 5|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+--------+-------+--------+-------+
我该怎么做?
afaik(至少我找不到没有聚集的枢轴的方法) scala版本:scala> df.groupBy("event_id","person_id","channel")
.pivot("group")
.agg(max("num1") as "num1", max("num2") as "num2")
.na.fill(0)
.show
+--------+---------+-------+-------+-------+-------+-------+
|event_id|person_id|channel|G1_num1|G1_num2|G2_num1|G2_num2|
+--------+---------+-------+-------+-------+-------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 5|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+-------+-------+-------+-------+