大家好,
首先,我需要你知道,我已经尝试了许多解决方案,你可能会在你最喜欢的搜索引擎的第一页找到。它涉及错误:
TypeError: field dt: TimestampType can not accept object '2021-05-01T09:19:46' in type <class 'str'>
我的数据作为raw.csv
存储在Amazon S3桶中,看起来像:
2021-05-01T09:19:46,...
2021-05-01T09:19:42,...
2021-05-01T09:19:39,...
I have try:
from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
df = GlueContext(SparkContext.getOrCreate()).create_dynamic_frame.from_options(
connection_type="s3",
connection_options={ 'paths': ["s3://bucket/to/raw.csv"] },
format="csv",
format_options={'withHeader': True}
).toDF()
events_schema = StructType([
StructField("dt", TimestampType(), nullable=False),
# and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss"))
.show(1, False)
和
df.withColumn("dt", unix_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")
.cast("double")
.cast("timestamp"))
.show(1, False)
我仍然有完全相同的错误。
尝试将dt
读取为stringtype然后用df.withColumn
强制转换为timestamptype
Example:
events_schema = StructType([
StructField("dt", StringType(), nullable=False),
# and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.show(10,False)
#+-------------------+
#|dt |
#+-------------------+
#|2021-05-01T09:19:46|
#+-------------------+
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")).show()
#+-------------------+
#| dt|
#+-------------------+
#|2021-05-01 09:19:46|
#+-------------------+