下表的"员工"和"工资"没有特定的顺序。我正在尝试获得一个输出,其中前 30% 获得"高"值,接下来的 40% 获得"平均"值,其余值获得"低">
Employee Salary
Tony 50000
Alan 45000
Lee 60000
David 35000
Steve 65000
Paul 48000
Micky 62000
George 80000
Nigel 64000
John 42000
输出:
Employee Salary Percentage
Tony 50000 Average
Alan 45000 Low
Lee 60000 Average
David 35000 Low
Steve 65000 High
Paul 48000 Average
Micky 62000 Average
George 80000 High
Nigel 64000 High
John 42000 Low
任何帮助将不胜感激!
这可以
通过使用Window
函数percent_rank
来完成。但是,它需要在"工资"列之后对数据帧进行排序。percent_rank
函数将根据排序顺序为每一行提供一个百分比值,更具体地说,给定的值为:
(分区中的行排名 - 1(/(分区中的行数 - 1(
假设原始数据帧df
:
val df2 = df.withColumn("Percentage", percent_rank over Window.orderBy("Salary"))
.withColumn("Percentage", when($"Percentage" > 0.7, "High").
when($"Percentage" < 0.3, "Low").
otherwise("Average"))
使用问题中的数据的结果将是:
+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
| David| 35000| Low|
| John| 42000| Low|
| Alan| 45000| Low|
| Paul| 48000| Average|
| Tony| 50000| Average|
| Lee| 60000| Average|
| Micky| 62000| Average|
| Nigel| 64000| High|
| Steve| 65000| High|
| George| 80000| High|
+--------+------+----------+
您可以按如下方式实现它:
import org.apache.spark.sql.functions.percent_rank
import org.apache.spark.sql.expressions.Window
dataDF.show
+--------+------+
|Employee|Salary|
+--------+------+
| Tony| 50000|
| Alan| 45000|
| Lee| 60000|
| David| 35000|
| Steve| 65000|
| Paul| 48000|
| Micky| 62000|
| George| 80000|
| Nigel| 64000|
| John| 42000|
+--------+------+
val window = Window.partitionBy().orderBy(dataDF("Salary"))
dataDF.withColumn("rank",
percent_rank().over(window).alias("rank")).withColumn("Percentage",
when($"rank" > 0.7, "High").when($"rank" <= 0.7 && $"rank" > 0.3,
"Average").otherwise("Low")).drop("rank").show
+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
| David| 35000| Low|
| John| 42000| Low|
| Alan| 45000| Low|
| Paul| 48000| Average|
| Tony| 50000| Average|
| Lee| 60000| Average|
| Micky| 62000| Average|
| Nigel| 64000| High|
| Steve| 65000| High|
| George| 80000| High|
+--------+------+----------+