DateType列读取为CSV文件中的StringType，即使提供了适当的架构

我正在尝试使用PySpark读取一个CSV文件，该文件包含格式为"；dd/MM/yyyy"；。我已经在模式定义中将字段指定为DateType()，并且还提供了选项"；日期格式"在DataFrame CSV读取器中。但是，读取后的输出数据帧的字段为StringType()，而不是DateType()。

样本输入数据：

"school_id","gender","class","doj"
"1","M","9","01/01/2020" 
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"

我的代码：

from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,
StructField("gender", StringType(),True) ,
StructField("class", StringType(),True) ,
StructField("doj", DateType(),True)    
])
school_students_df = spark.read.format("csv") 
.option("header", True) 
.option("schema", school_students_schema) 
.option("dateFormat", "dd/MM/yyyy") 
.load("/user/test/school_students.csv")
school_students_df.printSchema()

运行以上操作后的实际输出(列doj被解析为字符串，而不是指定的DateType和dateFormat，没有任何异常)。

root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)

预期输出：

root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)

运行时环境

Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

请求您的帮助以了解：

为什么即使在模式中提到了DateType，列也被解析为StringType
代码中需要做些什么才能将列doj解析为DateType()

您应该使用

.schema(school_students_schema)

而不是

.option("schema", school_students_schema)

(在可用的option列表中没有"模式"。)

需要

.option("dateFormat", "some format")

或适当的默认格式。如果格式不正确，则变为字符串类型。

顺便说一句，只有一种日期格式可以通过这种方式。否则在行操作中。

相关内容

最新更新

热门标签：