Pyspark DataFrame无法正确解析时间

我有一个带有DateTime作为字符串类型的dataFrame：

Row(Created Date=u'7/6/15 10:58', Closed Date=u'7/22/15 1:07)

我想将其转换为DateTime，所以我尝试这种方式。首先，我无需删除所有行。

df = df.na.drop(subset=["Created Date"])
df = df.na.drop(subset=["Closed Date"])

然后我指定了DateTime字符串的格式

func =  F.udf(lambda x: datetime.strptime(x, '%m/%d/%y %H:%M'), DateType())

然后，我将功能应用于这两列，如

df = df.withColumn('Created Date', func(F.col('Created Date')))
df = df.withColumn('Closed Date', func(F.col('Closed Date')))

但是，当我查看我的dataframe时，DateTime以这种格式

Row(Created Date=datetime.date(2015, 7, 6), Closed Date=datetime.date(2015, 7, 22)

看来，小时和几秒钟已经神秘地消失了。我是错误地解析日期时间还是罪魁祸首？谢谢！

这里有两个问题。

当您需要TimestampType时，您正在使用DateType（这就是为什么您获得意外结果的原因）：

from pyspark.sql.types import TimestampType
df = sc.parallelize([
    (u'7/6/15 10:58', '7/22/15 1:07')
]).toDF(['Created Date', 'Closed Date'])
as_timestamp = F.udf(
    lambda x: datetime.strptime(x, '%m/%d/%y %H:%M'), TimestampType()
)
df.select(
    as_timestamp("Created Date"), as_timestamp("Closed Date")
).show(1, False)

+----------------------+---------------------+
|<lambda>(Created Date)|<lambda>(Closed Date)|
+----------------------+---------------------+
|2015-07-06 10:58:00.0 |2015-07-22 01:07:00.0|
+----------------------+---------------------+

您使用的是可以使用本机功能的UDF（这引入了重大的性能惩罚）：

def native_as_timestamp(colname, format='MM/dd/yy HH:mm'):
    """
    https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html
    """
    return F.unix_timestamp(
        colname, format
    ).cast(TimestampType()).alias(colname)
df.select(
    native_as_timestamp('Created Date'), native_as_timestamp('Closed Date')
).show(1, False)

+---------------------+---------------------+
|Created Date         |Closed Date          |
+---------------------+---------------------+
|2015-07-06 10:58:00.0|2015-07-22 01:07:00.0|
+---------------------+---------------------+

在最新版本（> = 2.2.0）中，您可以用to_timesatmap替换unix_timestamp(...).cast(...)：

df.select(
    F.to_timestamp('Created Date', 'MM/dd/yy HH:mm').alias('Created Date'),     
    F.to_timestamp('Closed Date', 'MM/dd/yy HH:mm').alias('Closed Date')
).show(1, False)

相关内容

最新更新

热门标签：