如何启用Spark DataFrame/DataSet的严格数据类型?我们正在从上游系统收到许多系统生成的系统和手动馈送,以进行转换。 要求在开始转换之前提取供稿并对模式进行严格的数据类型检查 有人可以建议我们如何使用Spark 2.0?
我们尝试关注
1. User infereSchema = true, while reading file and get generated dataframes schema to validate against expected schema. Normally infereSchema= true is two phase operation, prove costly for give file
2. Enforcing schema while creating data frame from csv file
val df:DataFrame = spark.read.format("csv")
.schema(readSchemaFromAvroSchemaFile)
.option("header","true")
.option("inferSchema","false")
.csv("CSVFileUri")
strict data type check not imposed while writing,
it applied only while reading dataframe
Is it possible to validate without making read call as it could be expensive operation?
Also in case of double type show some strange behavior
if we have avro schema
{
"namespace":"com.test.schema.validation",
"name" : "example",
"type" : "record",
"fields" [
{"name":"item_id","type":["null","string"],"default":null},
{"name":"item_price","type":["null","double"],"default":null}
]
}
CSV file
item_id|item_price
1|234.90
2|634.90
3|534.90
4|233A40.90
5|233E12
df.show(10)- gives me following
item_id|item_price
1|234.90
2|634.90
3|534.90
4|233.90
5|2.3E13
Value is Row#4 truncated without any failure so it's hard catch
Please suggest if you have any efficient way to validate schema
Have you come across double type value truncation?
我假设您正在使用scala,因此我的建议将使用案例类来定义您的架构。您可以做以下操作:
case class Item(item_price: Long, item_id: Long)
val item = spark.
read.
schema(schema).
csv("path").
as[Item]
让我知道您对此的看法。
建议阅读Databricks的这篇文章。