我有下面的数据帧。
+-------+---+----+
|Company|EMP|Flag|
+-------+---+----+
| M| c1| Y|
| M| c1| Y|
| M| c2| N|
| M| c2| N|
| M| c3| Y|
| M| c3| Y|
| M| c4| N|
| M| c4| N|
| M| c5| Y|
| M| c5| Y|
| M| c6| Y|
+-------+---+----+
创建者-
val df1=Seq(
("M","c1","Y"),
("M","c1","Y"),
("M","c2","N"),
("M","c2","N"),
("M","c3","Y"),
("M","c3","Y"),
("M","c4","N"),
("M","c4","N"),
("M","c5","Y"),
("M","c5","Y"),
("M","c6","Y")
)toDF("Company","EMP","Flag")
当FLAG=Y和FLAG=N时,我如何获得不同计数的EMP。一旦EMP有了标志,它就不会再改变了。我可以做到这一点。但是有没有任何方法可以在没有区别的情况下实现这一点(这是为了避免代码中的额外连接(
预期输出:
+---+---+---+---------+----------+
| M| Y| N|Total_ROWs|Unique_Emp|
+---+---+---+---------+----------+
| M| 4| 2| 11| 6|
+---+---+---+---------+----------+
这个怎么样?
df1.groupBy("Company", "EMP", "Flag")
.agg(count("Company").as("Row"))
.groupBy("Company", "EMP", "Flag")
.agg(count("Flag").as("YN"), sum("Row").as("Row"))
.groupBy("Company")
.agg(count(when($"Flag" === "Y", 1)).as("Y"), count(when($"Flag" === "N", 1)).as("N"), sum("Row").as("Total_ROWs"), count("EMP").as("Unique_EMP"))
.show
+-------+---+---+----------+----------+
|Company| Y| N|Total_ROWs|Unique_EMP|
+-------+---+---+----------+----------+
| M| 4| 2| 11| 6|
+-------+---+---+----------+----------+