检查数据框中列的日期时间格式



我有一个输入日期框架,其中包含以下数据:

id   date_column
1     2011-07-09 11:29:31+0000
2     2011-07-09T11:29:31+0000
3     2011-07-09T11:29:31
4     2011-07-09T11:29:31+0000

我想检查date_column的格式是否与格式"%Y-%m-%dT%H:%M:%S+0000"匹配,如果格式匹配,我想添加一个列,其值为 1,否则为 0。目前,我已经定义了一个 UDF 来执行此操作:

def date_pattern_matching(value, pattern):
    try:
        datetime.strptime(str(value),pattern)
        return "1"
    except:
        return "0"

它生成以下输出数据帧:

id   date_column                       output
1     2011-07-09 11:29:31+0000           0
2     2011-07-09T11:29:31+0000           1
3     2011-07-09T11:29:31                0
4     2011-07-09T11:29:31+0000           1

通过UDF执行需要花费大量时间,有没有其他方法可以实现它?

尝试使用正则表达式 pyspark.sql.Column.rlike 运算符,否则阻止

from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'], 
    [1,"2011-07-09 11:29:31+0000"], 
    [2,"2011-07-09T11:29:31+0000"],
    [3,"2011-07-09T11:29:31"],
    [4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])

regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}+?-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column             |output|
+---+------------------------+------+
|1  |2011-07-09 11:29:31+0000|0     |
|1  |2011-07-09 11:29:31+0000|0     |
|2  |2011-07-09T11:29:31+0000|1     |
|3  |2011-07-09T11:29:31     |0     |
|4  |2011-07-09T11:29:31+0000|1     |
+---+------------------------+------+

相关内容

  • 没有找到相关文章

最新更新