关于在Spark Scala中创建用户定义函数(UDF)



我是Scala的初学者,想了解Spark Scala中的UDF。我将用下面的例子来演示我的问题。我使用的是Spark Scala数据块。

假设我有以下数据帧,

val someDF = Seq(
(1, "bat"),
(4, "mouse"),
(3, "horse")
).toDF("number", "word")
someDF.show()
+------+-----+
|number| word|
+------+-----+
|     1|  bat|
|     4|mouse|
|     3|horse|
+------+-----+

我需要创建一个函数,通过对数字列进行一些运算来计算一个新列。

举个例子,我创建了这个函数来计算25/(数字+1(,如下所示,它成功了。

import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.functions._
val caldf = udf { (df: Double) => (25/(df+1)) }
someDF.select($"number", $"word", caldf(col("number")) as "newc").show()
+------+-----+----+
|number| word|newc|
+------+-----+----+
|     1|  bat|12.5|
|     4|mouse| 5.0|
|     3|horse|6.25|
+------+-----+----+

但当我用日志操作符尝试时,它不起作用

import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.functions._
val caldf = udf { (df: Double) => log(25/(df+1)) }


command-3140852555505238:3: error: overloaded method value log with alternatives:
(columnName: String)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
cannot be applied to (Double)
val caldf = udf { (df: Double) => log(25/(df+1)) }
^

有人能帮我弄清楚原因是什么吗?非常感谢。

您的问题中的函数不需要udf:

someDF.select($"number", $"word", log(lit(25) / (lit(1) + $"number")) as "newC")

如果你坚持使用udf:

val caldf = udf { df: Double => math.log(25/(df+1)) }

最新更新