考虑以下数据:
Name | Flag
A | 0
A | 1
A | 0
B | 0
B | 1
B | 1
我想把它转换成:
Name | Total | With Flag | Percentage
A | 3 | 1 | 33%
B | 3 | 2 | 66%
最好是在Spark SQL中。
例如:
val df = sc.parallelize(Seq(
("A", 0), ("A", 1), ("A", 0),
("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")
df.groupBy($"Name").agg(
count("*").alias("total"),
sum($"flag").alias("with_flag"),
// Do you really want to truncate not for example round?
mean($"flag").multiply(100).cast("integer").alias("percentage"))
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
或:
df.registerTempTable("df")
sqlContext.sql("""
SELECT name, COUNT(*) total, SUM(flag) with_flag,
CAST(AVG(flag) * 100 AS INT) percentage
FROM df
GROUP BY name""")
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+