基于火花标量中的条件的CountDistinct



我有下面的数据帧。

+-------+---+----+
|Company|EMP|Flag|
+-------+---+----+
|      M| c1|   Y|
|      M| c1|   Y|
|      M| c2|   N|
|      M| c2|   N|
|      M| c3|   Y|
|      M| c3|   Y|
|      M| c4|   N|
|      M| c4|   N|
|      M| c5|   Y|
|      M| c5|   Y|
|      M| c6|   Y|
+-------+---+----+

创建者-

val df1=Seq(
("M","c1","Y"),
("M","c1","Y"),
("M","c2","N"),
("M","c2","N"),
("M","c3","Y"),
("M","c3","Y"),
("M","c4","N"),
("M","c4","N"),
("M","c5","Y"),
("M","c5","Y"),
("M","c6","Y")
)toDF("Company","EMP","Flag")

当FLAG=Y和FLAG=N时,我如何获得不同计数的EMP。一旦EMP有了标志,它就不会再改变了。我可以做到这一点。但是有没有任何方法可以在没有区别的情况下实现这一点(这是为了避免代码中的额外连接(

预期输出:

+---+---+---+---------+----------+
|  M|  Y|  N|Total_ROWs|Unique_Emp|
+---+---+---+---------+----------+
|  M|  4|  2|       11|         6|
+---+---+---+---------+----------+

这个怎么样?

df1.groupBy("Company", "EMP", "Flag")
.agg(count("Company").as("Row"))
.groupBy("Company", "EMP", "Flag")
.agg(count("Flag").as("YN"), sum("Row").as("Row"))
.groupBy("Company")
.agg(count(when($"Flag" === "Y", 1)).as("Y"), count(when($"Flag" === "N", 1)).as("N"), sum("Row").as("Total_ROWs"), count("EMP").as("Unique_EMP"))
.show
+-------+---+---+----------+----------+
|Company|  Y|  N|Total_ROWs|Unique_EMP|
+-------+---+---+----------+----------+
|      M|  4|  2|        11|         6|
+-------+---+---+----------+----------+

最新更新