如何处理数据帧scala中的00000000(yyyyMMdd)这样的日期?



我是 scala 的新手,我有 test.csv 文件,我可以读取数据帧,但我在 csv 文件中有一个 DATE 列,它包含类似"00000000"的日期。我如何处理像"00000000"这样的日期。 如果日期像"00000000",我如何用空替换。任何帮助将不胜感激。

测试.csv

TYPE,CODE,SQ_CODE,RE_TYPE,VERY_ID,IN_DATE,DATE
"F","000544","2017002","OP","95032015062763298","20150610","00000000"
"F","000544","2017002","LD","95032015062763261","20150611","20150519"
"F","000544","2017002","AK","95037854336743246","20150611","20150429"   
val sparkSession = SparkSession.builder().master("local").appName("IT_DATA").getOrCreate()

//Create dataframe boject
val df = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("location", "/xx/xx/xx/xx/test.csv") 
.option("delimiter", ",")
.option("dateFormat", "yyyyMMdd")
.load().cache()
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: date (nullable = true)
|-- DATE: date (nullable = true)
df.show

实际输出

+-----+-------+---------+---------+-------------------+---------+-------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE|      DATE   |
+-----+-------+---------+---------+-------------------+---------+-------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610|   0002-11-30|
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611|   2015-05-19|
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611|   2015-04-29|
+-----+-------+---------+--+------+-------------------+---------+-------------+

预期产出

+-----+-------+---------+---------+-------------------+---------+-------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE|      DATE   |
+-----+-------+---------+---------+-------------------+---------+-------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610|        null |
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611|   2015-05-19|
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611|   2015-04-29|
+-----+-------+---------+--+------+-------------------+---------+-------------+

"00000000"是根据您的日期格式有效的日期格式。 因此,除非您更改"null"值,否则 Spark 无法将其视为空值。

在 csv 阅读器中,您可以指定它:

nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.

您可以对 DF 应用转换,以便在超出有效范围时将其转换回 null

val df = List((1,"0000"), (2, "12345")).toDF("id", "value")
df.withColumn("value", when(df("value") === "0000", null).otherwise(df("value"))).show

由于您的日期已解析为某些值,因此您可以使用如下简单函数对其进行筛选:

df.withColumn("date", when(year(df("date")) < 1900, null).otherwise(df("value"))).show

最新更新