PySpark convert str to TimestampType



大家好,

首先,我需要你知道,我已经尝试了许多解决方案,你可能会在你最喜欢的搜索引擎的第一页找到。它涉及错误:

TypeError: field dt: TimestampType can not accept object '2021-05-01T09:19:46' in type <class 'str'>

我的数据作为raw.csv存储在Amazon S3桶中,看起来像:

2021-05-01T09:19:46,...
2021-05-01T09:19:42,...
2021-05-01T09:19:39,...

I have try:

from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
df = GlueContext(SparkContext.getOrCreate()).create_dynamic_frame.from_options(
connection_type="s3",
connection_options={ 'paths': ["s3://bucket/to/raw.csv"] },
format="csv",
format_options={'withHeader': True}
).toDF()
events_schema = StructType([
StructField("dt", TimestampType(), nullable=False),
# and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss"))
.show(1, False)

df.withColumn("dt", unix_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")
.cast("double")
.cast("timestamp"))
.show(1, False)

我仍然有完全相同的错误。

尝试将dt读取为stringtype然后用df.withColumn强制转换为timestamptype

Example:

events_schema = StructType([
StructField("dt", StringType(), nullable=False),
# and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.show(10,False)
#+-------------------+
#|dt                 |
#+-------------------+
#|2021-05-01T09:19:46|
#+-------------------+
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")).show()
#+-------------------+
#|                 dt|
#+-------------------+
#|2021-05-01 09:19:46|
#+-------------------+

最新更新