Spark SQL:将星期天名称转换为整数



我正在尝试将星期几字符串名称(例如TUESDAY)转换为整数(例如3)。我在下面写了一张地图,但我不确定如何应用它。

import org.apache.spark.sql.functions.lower
val dayNameToInteger = Map(
    "sunday"    -> 1,
    "monday"    -> 2,
    "tuesday"   -> 3,
    "wednesday" -> 4,
    "thursday"  -> 5,
    "friday"    -> 6,
    "saturday"  -> 7)

var input = sqlContext.createDataFrame(Seq(
    (0L, "SUNDAY", 34),
    (1L, "Monday", 31),
    (2L, "tuesday", 25)
    )).toDF("id", "day_of_week", "value")

scala> input.show
+---+-----------+-----+
| id|day_of_week|value|
+---+-----------+-----+
|  0|     SUNDAY|   34|
|  1|     Monday|   31|
|  2|    tuesday|   25|
+---+-----------+-----+

var output = input.select($"id", dayNameToInteger(lower(input("day_of_week"))))
<console>:27: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: String
       var output = input.select($"id", dayNameToInteger(lower(input("day_of_week"))))

转换应该通过UDF完成,因为UDF可以对所有行中的一列进行操作。

val dayToInt = udf((dayOfWeek:String) => {
    dayOfWeek match {
        case "sunday"    => 1
        case "monday"    => 2
        case "tuesday"   => 3 
        case "wednesday" => 4
        case "thursday"  => 5
        case "friday"    => 6
        case "saturday"  => 7
    }
})  

var output = input.select($"id", dayToInt(lower(input("day_of_week"))).as("day_int"))

scala> output.show
+---+-------+
| id|day_int|
+---+-------+
|  0|      1|
|  1|      2|
|  2|      3|
+---+-------+

作为对上述答案的改进,以下是我与Spark 2.x兼容的udf,用于将星期几字符串索引为整数:

spark.sqlContext.udf.register("dayToInt", (dayOfWeek:String) => {
    dayOfWeek.toLowerCase() match {
        case "sunday"    => 1
        case "monday"    => 2
        case "tuesday"   => 3 
        case "wednesday" => 4
        case "thursday"  => 5
        case "friday"    => 6
        case "saturday"  => 7
    }
})

也用于机器学习模型:

sql("""SELECT CAST(dayToInt(date_format(dateEst, 'EEEE')) AS float) AS dayOfWeekInt 
FROM data""")

相关内容

  • 没有找到相关文章

最新更新