Scala: Spark SQL to_date(unix_timestamp) returning NULL

Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8

我正在加载一个原始的csv到DataFrame。在csv中，虽然列支持日期格式，但它们被写为20161025而不是2016-10-25。date_format参数包含需要转换为yyyy-mm-dd格式的列名字符串。

在下面的代码中，我首先通过schema将日期列的csv加载为StringType，然后检查date_format是否为空，即有需要从String转换为Date的列，然后使用unix_timestamp和to_date强制转换每个列。但是，在csv_df.show()中，返回的行都是null。

def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean, 
    schema:StructType, date_format:List[String]): DataFrame = {
    println("|||| Reading CSV Input ||||")
    var csv_df = sqlContext.read
        .format("com.databricks.spark.csv")
        .schema(schema)
        .option("header", is_first_line_header)
        .option("delimiter", delimiter)
        .load(csv_source)
    println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
    if(date_format.length > 0) {
        for (i <- 0 until date_format.length) {
            csv_df = csv_df.select(to_date(unix_timestamp(
                csv_df(date_format(i)), "yyyy-MM-dd").cast("timestamp")))
            csv_df.show()
        }
    }
    csv_df
}

返回前20行:

+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-MM-DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
+-------------------------------------------------------------------------+

为什么我得到所有的null ?

要将yyyyMMdd转换为yyyy-MM-dd，可以:

spark.sql("""SELECT DATE_FORMAT(
  CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")

与功能:

date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")

相关内容

最新更新

热门标签：