我尝试将在timestamp_value
中找到的utc值拆分为一个名为utc
的新列。我试图使用Python RegEx,但我无法做到这一点。谢谢你的回答!
这就是我的数据框架的样子
+--------+----------------------------+
|machine |timestamp_value |
+--------+----------------------------+
|1 |2022-01-06T07:47:37.319+0000|
|2 |2022-01-06T07:47:37.319+0000|
|3 |2022-01-06T07:47:37.319+0000|
+--------+----------------------------+
应该是这样的
+--------+----------------------------+-----+
|machine |timestamp_value |utc |
+--------+----------------------------------+
|1 |2022-01-06T07:47:37.319 |+0000|
|2 |2022-01-06T07:47:37.319 |+0000|
|3 |2022-01-06T07:47:37.319 |+0000|
+--------+----------------------------------+
您可以分别使用regexp_extract
和regexp_replace
来执行此操作
import pyspark.sql.functions as F
(df
.withColumn('utc', F.regexp_extract('timestamp_value', '.*(+.*)', 1))
.withColumn('timestamp_value', F.regexp_replace('timestamp_value', '+(.*)', ''))
).show(truncate=False)
+-------+-----------------------+-----+
|machine|timestamp_value |utc |
+-------+-----------------------+-----+
|1 |2022-01-06T07:47:37.319|+0000|
|2 |2022-01-06T07:47:37.319|+0000|
|3 |2022-01-06T07:47:37.319|+0000|
+-------+-----------------------+-----+
为了更好地理解正则表达式的含义,请查看这个工具。