我有一个数据框架,其中一列作为"时间戳"。数据格式为:01/Jul/1995:00:00:01 -0400
.
现在我想将其转换为01/Jul/1995,并创建一个额外的列,该列将包含相应日期的天数(例如,周六,周日)。我该怎么做呢?
创建示例数据框架
import datetime
from pyspark.sql import Row
df_rows = {Row(id = 1, timestamp = datetime.datetime(2019, 12, 13)),
Row(id = 2, timestamp = datetime.datetime(2019, 12, 14))
}
df = spark.createDataFrame(df_rows)
df.show()
# +---+-------------------+
# | id| timestamp|
# +---+-------------------+
# | 1|2019-12-13 09:00:00|
# | 2|2019-12-14 09:00:00|
# +---+-------------------+
时间戳格式
from pyspark.sql.functions import col, to_timestamp, date_format, dayofweek
df.select(col("timestamp"),
date_format(col("timestamp"), "dd/MM/yyyy").alias("date"),
dayofweek(col("timestamp")).alias("day_n"),
date_format("timestamp", "E").alias("day_s")).show()
# +-------------------+----------+-----+-----+
# | timestamp| date|day_n|day_s|
# +-------------------+----------+-----+-----+
# |2019-12-13 09:00:00|13/12/2019| 6| Fri|
# |2019-12-14 09:00:00|14/12/2019| 7| Sat|
# +-------------------+----------+-----+-----+
使用Pyspark的dayofweek
(以数字形式返回星期几)和date_format
(以星期模式E显示星期几)作为字符串。
我假设数据框包含时间戳数据类型。如果数据框包含字符串,您需要使用to_timestamp
将它们转换为时间戳。