我有一个带有DateTime作为字符串类型的dataFrame:
Row(Created Date=u'7/6/15 10:58', Closed Date=u'7/22/15 1:07)
我想将其转换为DateTime,所以我尝试这种方式。首先,我无需删除所有行。
df = df.na.drop(subset=["Created Date"])
df = df.na.drop(subset=["Closed Date"])
然后我指定了DateTime字符串的格式
func = F.udf(lambda x: datetime.strptime(x, '%m/%d/%y %H:%M'), DateType())
然后,我将功能应用于这两列,如
df = df.withColumn('Created Date', func(F.col('Created Date')))
df = df.withColumn('Closed Date', func(F.col('Closed Date')))
但是,当我查看我的dataframe时,DateTime以这种格式
Row(Created Date=datetime.date(2015, 7, 6), Closed Date=datetime.date(2015, 7, 22)
看来,小时和几秒钟已经神秘地消失了。我是错误地解析日期时间还是罪魁祸首?谢谢!
这里有两个问题。
-
当您需要
TimestampType
时,您正在使用DateType
(这就是为什么您获得意外结果的原因):from pyspark.sql.types import TimestampType df = sc.parallelize([ (u'7/6/15 10:58', '7/22/15 1:07') ]).toDF(['Created Date', 'Closed Date']) as_timestamp = F.udf( lambda x: datetime.strptime(x, '%m/%d/%y %H:%M'), TimestampType() ) df.select( as_timestamp("Created Date"), as_timestamp("Closed Date") ).show(1, False)
+----------------------+---------------------+ |<lambda>(Created Date)|<lambda>(Closed Date)| +----------------------+---------------------+ |2015-07-06 10:58:00.0 |2015-07-22 01:07:00.0| +----------------------+---------------------+
-
您使用的是可以使用本机功能的UDF(这引入了重大的性能惩罚):
def native_as_timestamp(colname, format='MM/dd/yy HH:mm'): """ https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html """ return F.unix_timestamp( colname, format ).cast(TimestampType()).alias(colname) df.select( native_as_timestamp('Created Date'), native_as_timestamp('Closed Date') ).show(1, False)
+---------------------+---------------------+ |Created Date |Closed Date | +---------------------+---------------------+ |2015-07-06 10:58:00.0|2015-07-22 01:07:00.0| +---------------------+---------------------+
在最新版本(> = 2.2.0)中,您可以用
to_timesatmap
替换unix_timestamp(...).cast(...)
:df.select( F.to_timestamp('Created Date', 'MM/dd/yy HH:mm').alias('Created Date'), F.to_timestamp('Closed Date', 'MM/dd/yy HH:mm').alias('Closed Date') ).show(1, False)