将spark RDD中列的数据类型更改为date并对其进行查询



默认情况下,当我加载数据时,每一列都被视为字符串类型。数据看起来像:

firstName,lastName,age,doj
dileep,gog,21,2016-01-01
avishek,ganguly,21,2016-01-02
shreyas,t,20,2016-01-03

更新RDD的模式后,它看起来像

temp.printSchema
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- age: string (nullable = true)
|-- doj: date (nullable = true)

注册一个临时表并在其上查询

temp.registerTempTable("temptable");
 val temp1 = sqlContext.sql("select * from temptable")
 temp1.show()
+---------+--------+---+----------+
|firstName|lastName|age|       doj|
+---------+--------+---+----------+
|   dileep|     gog| 21|2016-01-01|
|  avishek| ganguly| 21|2016-01-02|
|  shreyas|       t| 20|2016-01-03|
+---------+--------+---+----------+
 val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")

但当我试图看到结果时,它给了我:

temp2: org.apache.spark.sql.DataFrame = [firstName: string, lastName: string, age: string, doj: date]

当我做时

temp2.show()
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

所以我尝试了你的代码,它对我来说很有效。我怀疑问题出在你最初如何更改模式上,这对我来说不太好(当然,当你在评论中发布它时有点难以阅读-你应该用代码更新问题)。

无论如何,我是这样做的:

首先模拟您的输入:

val df = sc.parallelize(List(("dileep","gog","21","2016-01-01"), ("avishek","ganguly","21","2016-01-02"), ("shreyas","t","20","2016-01-03"))).toDF("firstName", "lastName", "age", "doj")

然后:

import org.apache.spark.sql.functions._
val temp = df.withColumn("doj", to_date('doj))
temp.registerTempTable("temptable");
val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")

按预期进行temp2.show()显示:

+---------+--------+---+----------+
|firstName|lastName|age|       doj|
+---------+--------+---+----------+
|  shreyas|       t| 20|2016-01-03|
+---------+--------+---+----------+

相关内容

  • 没有找到相关文章

最新更新