如何添加具有基于热门程度的值的列



下表的"员工"和"工资"没有特定的顺序。我正在尝试获得一个输出,其中前 30% 获得"高"值,接下来的 40% 获得"平均"值,其余值获得"低">

Employee  Salary
Tony      50000
Alan      45000
Lee       60000
David     35000
Steve     65000
Paul      48000
Micky     62000
George    80000
Nigel     64000
John      42000

输出:

Employee   Salary   Percentage
Tony       50000    Average
Alan       45000    Low
Lee        60000    Average
David      35000    Low
Steve      65000    High
Paul       48000    Average
Micky      62000    Average
George     80000    High
Nigel      64000    High
John       42000    Low

任何帮助将不胜感激!

这可以

通过使用Window函数percent_rank来完成。但是,它需要在"工资"列之后对数据帧进行排序。percent_rank函数将根据排序顺序为每一行提供一个百分比值,更具体地说,给定的值为:

(分区中的行排名 - 1(/(分区中的行数 - 1(

假设原始数据帧df

val df2 = df.withColumn("Percentage", percent_rank over Window.orderBy("Salary"))
  .withColumn("Percentage", when($"Percentage" > 0.7, "High").
                            when($"Percentage" < 0.3, "Low").
                            otherwise("Average"))

使用问题中的数据的结果将是:

+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
|   David| 35000|       Low|
|    John| 42000|       Low|
|    Alan| 45000|       Low|
|    Paul| 48000|   Average|
|    Tony| 50000|   Average|
|     Lee| 60000|   Average|
|   Micky| 62000|   Average|
|   Nigel| 64000|      High|
|   Steve| 65000|      High|
|  George| 80000|      High|
+--------+------+----------+

您可以按如下方式实现它:

import org.apache.spark.sql.functions.percent_rank
import org.apache.spark.sql.expressions.Window
 dataDF.show
+--------+------+
|Employee|Salary|
+--------+------+
|    Tony| 50000|
|    Alan| 45000|
|     Lee| 60000|
|   David| 35000|
|   Steve| 65000|
|    Paul| 48000|
|   Micky| 62000|
|  George| 80000|
|   Nigel| 64000|
|    John| 42000|
+--------+------+
val window = Window.partitionBy().orderBy(dataDF("Salary"))
dataDF.withColumn("rank", 
percent_rank().over(window).alias("rank")).withColumn("Percentage", 
when($"rank" > 0.7, "High").when($"rank" <= 0.7 && $"rank" > 0.3, 
"Average").otherwise("Low")).drop("rank").show
+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
|   David| 35000|       Low|
|    John| 42000|       Low|
|    Alan| 45000|       Low|
|    Paul| 48000|   Average|
|    Tony| 50000|   Average|
|     Lee| 60000|   Average|
|   Micky| 62000|   Average|
|   Nigel| 64000|      High|
|   Steve| 65000|      High|
|  George| 80000|      High|
+--------+------+----------+

相关内容

  • 没有找到相关文章

最新更新