spark-scala数据帧时间戳转换排序

我有一个格式为的csv

t,value
2012-01-12 12:30:00,4
2012-01-12 12:45:00,3
2012-01-12 12:00:00,12
2012-01-12 12:15:00,13
2012-01-12 13:00:00,7

我使用spark-csv将其转换为数据帧。（因此t是String类型，而value是Integer类型）。什么是合适的火花标量方式，使输出按时间排序？

我想把t转换成某种类型，这样可以允许数据帧sortBy。但我不熟悉哪种时间戳类型允许按时间对数据帧进行排序。

给定格式，您可以将时间戳转换为

import org.apache.spark.sql.types.TimestampType
df.select($"t".cast(TimestampType)) // or df.select($"t".cast("timestamp"))

要获得正确的日期时间或使用unix_timestamp（Spark 1.5+，在Spark<1.5中，可以使用同名的Hive UDF）函数：

import org.apache.spark.sql.functions.unix_timestamp
df.select(unix_timestamp($"t"))

以获得数字表示（Unix时间戳，单位为秒）。

顺便说一句，你没有理由不能直接orderBy($"t")。词典顺序在这里应该很好用。

除了@zero323之外，如果您正在编写纯SQL，则可以使用CAST运算符，如下所示：

df.registerTempTable("myTable")    
sqlContext.sql("SELECT CAST(t as timestamp) FROM myTable")

如果使用"df.select"进行强制转换，则可能只得到指定的列。要更改指定列的类型，&保留其他列，应用"df.withColumn"并传递原始列名。

import org.apache.spark.sql.types._
val df1 = df.withColumn("t",col("t").cast(TimestampType))
df1.printSchema
root
 |-- t: timestamp (nullable = true)
 |-- value: integer (nullable = true)

仅更改列名"t"的数据类型。其余部分保留。

相关内容

最新更新

热门标签：