默认情况下,当我加载数据时,每一列都被视为字符串类型。数据看起来像:
firstName,lastName,age,doj
dileep,gog,21,2016-01-01
avishek,ganguly,21,2016-01-02
shreyas,t,20,2016-01-03
更新RDD
的模式后,它看起来像
temp.printSchema
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- age: string (nullable = true)
|-- doj: date (nullable = true)
注册一个临时表并在其上查询
temp.registerTempTable("temptable");
val temp1 = sqlContext.sql("select * from temptable")
temp1.show()
+---------+--------+---+----------+
|firstName|lastName|age| doj|
+---------+--------+---+----------+
| dileep| gog| 21|2016-01-01|
| avishek| ganguly| 21|2016-01-02|
| shreyas| t| 20|2016-01-03|
+---------+--------+---+----------+
val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")
但当我试图看到结果时,它给了我:
temp2: org.apache.spark.sql.DataFrame = [firstName: string, lastName: string, age: string, doj: date]
当我做时
temp2.show()
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
所以我尝试了你的代码,它对我来说很有效。我怀疑问题出在你最初如何更改模式上,这对我来说不太好(当然,当你在评论中发布它时有点难以阅读-你应该用代码更新问题)。
无论如何,我是这样做的:
首先模拟您的输入:
val df = sc.parallelize(List(("dileep","gog","21","2016-01-01"), ("avishek","ganguly","21","2016-01-02"), ("shreyas","t","20","2016-01-03"))).toDF("firstName", "lastName", "age", "doj")
然后:
import org.apache.spark.sql.functions._
val temp = df.withColumn("doj", to_date('doj))
temp.registerTempTable("temptable");
val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")
按预期进行temp2.show()
显示:
+---------+--------+---+----------+
|firstName|lastName|age| doj|
+---------+--------+---+----------+
| shreyas| t| 20|2016-01-03|
+---------+--------+---+----------+