我正在尝试将星期几字符串名称(例如TUESDAY)转换为整数(例如3)。我在下面写了一张地图,但我不确定如何应用它。
import org.apache.spark.sql.functions.lower
val dayNameToInteger = Map(
"sunday" -> 1,
"monday" -> 2,
"tuesday" -> 3,
"wednesday" -> 4,
"thursday" -> 5,
"friday" -> 6,
"saturday" -> 7)
var input = sqlContext.createDataFrame(Seq(
(0L, "SUNDAY", 34),
(1L, "Monday", 31),
(2L, "tuesday", 25)
)).toDF("id", "day_of_week", "value")
scala> input.show
+---+-----------+-----+
| id|day_of_week|value|
+---+-----------+-----+
| 0| SUNDAY| 34|
| 1| Monday| 31|
| 2| tuesday| 25|
+---+-----------+-----+
var output = input.select($"id", dayNameToInteger(lower(input("day_of_week"))))
<console>:27: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
var output = input.select($"id", dayNameToInteger(lower(input("day_of_week"))))
转换应该通过UDF完成,因为UDF可以对所有行中的一列进行操作。
val dayToInt = udf((dayOfWeek:String) => {
dayOfWeek match {
case "sunday" => 1
case "monday" => 2
case "tuesday" => 3
case "wednesday" => 4
case "thursday" => 5
case "friday" => 6
case "saturday" => 7
}
})
var output = input.select($"id", dayToInt(lower(input("day_of_week"))).as("day_int"))
scala> output.show
+---+-------+
| id|day_int|
+---+-------+
| 0| 1|
| 1| 2|
| 2| 3|
+---+-------+
作为对上述答案的改进,以下是我与Spark 2.x兼容的udf,用于将星期几字符串索引为整数:
spark.sqlContext.udf.register("dayToInt", (dayOfWeek:String) => {
dayOfWeek.toLowerCase() match {
case "sunday" => 1
case "monday" => 2
case "tuesday" => 3
case "wednesday" => 4
case "thursday" => 5
case "friday" => 6
case "saturday" => 7
}
})
也用于机器学习模型:
sql("""SELECT CAST(dayToInt(date_format(dateEst, 'EEEE')) AS float) AS dayOfWeekInt
FROM data""")