将字符串(带时间戳)转换为pyspark中的时间戳



我有一个字符串datetime列的数据框架。我正在将其转换为时间戳,但值在变化。以下是我的代码,谁能帮我转换不改变值。

df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp")) 
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', 
to_timestamp('timestamp').cast('string')) 
.show(truncate=False)

输出:

+---+--------------------------+-------------------+-------------------+
|id |input_timestamp           |timestamp          |timestamp_string   |
+---+--------------------------+-------------------+-------------------+
|1  |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+

我想知道为什么时间从15点变成了8点,我该如何预防?

我相信to_timestamp正在将时间戳值转换为您的local time因为你的数据中有+00:00

  • 尝试传递格式to_timestamp()功能。

Example:

from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp           |timestamp          |
#+---+--------------------------+-------------------+
#|1  |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp', 
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)

将your_local_timezone替换为实际值

最新更新