在读取json时解释Spark中的时间戳字段



我正在尝试读取一个漂亮的打印json,其中包含时间字段。我想在读取json本身时将时间戳列解释为时间戳字段。然而,当我printSchema时,它仍然将它们作为字符串读取

例如。输入json文件-

[{
"time_field" : "2017-09-30 04:53:39.412496Z"
}]

代码-

df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')

df.printSchema()-的输出

root
|-- time_field: string (nullable = true)

我在这里错过了什么?

我自己对选项timestampFormat的经验是,它并不像宣传的那样起作用。我只需将时间字段读取为字符串,并使用to_timestamp进行转换,如下所示(使用稍微广义的样本输入(:

# /path/to/jsonfile
[{
"id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
"id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]

在Python中:

from pyspark.sql.functions import to_timestamp
df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df = df.withColumn("timestamp", to_timestamp("time_field"))
df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field                 |timestamp          |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+
df.printSchema()
root
|-- id: long (nullable = true)
|-- time_field: string (nullable = true)
|-- timestamp: timestamp (nullable = true)

在Scala:

val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df.withColumn("timestamp", to_timestamp($"time_field"))

Spark 2.4.0版本中的错误发布Spark-26325

对于Spark 2.4.4版

import org.apache.spark.sql.types.TimestampType
//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")).toDF("input_timestamp")
val df_mod = df.select($"input_timestamp".cast(TimestampType))
df_mod.printSchema

输出

root
|-- input_timestamp: timestamp (nullable = true)

最新更新