基于现有列在数据帧中添加新列

我有一个带有日期时间列的csv文件："2011-05-02T04：52：09+00：00"。

我正在使用 scala，文件加载到 spark DataFrame 中，我可以使用 jodas 时间来解析日期：

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true")) 
val d = org.joda.time.format.DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")

我想根据日期时间字段创建新列以进行时间分析。

在数据帧中，如何基于另一列的值创建列？

我注意到数据帧具有以下功能：df.withColumn（"dt"，column），有没有办法根据现有列的值创建列？

谢谢

import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.functions._
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val d = DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
val dtFunc: (String => Date) = (arg1: String) => DateTime.parse(arg1, d).toDate
val x = df.withColumn("dt", callUDF(dtFunc, DateType, col("dt_string")))

callUDF，col作为import节目被收录在functions

中

col("dt_string") 中的dt_string是要从中转换的 df 的源列名称。

或者，您可以将最后一条语句替换为：

val dtFunc2 = udf(dtFunc)
val x = df.withColumn("dt", dtFunc2(col("dt_string")))

相关内容

最新更新

热门标签：