如何应用Spark DataFrame/DataSet的严格数据类型



如何启用Spark DataFrame/DataSet的严格数据类型?我们正在从上游系统收到许多系统生成的系统和手动馈送,以进行转换。 要求在开始转换之前提取供稿并对模式进行严格的数据类型检查 有人可以建议我们如何使用Spark 2.0?
我们尝试关注

 1. User infereSchema = true, while reading file and get generated dataframes schema to validate against expected schema. Normally infereSchema= true is two phase operation, prove costly for give file
 2. Enforcing schema while creating data frame from csv file

val df:DataFrame = spark.read.format("csv")
     .schema(readSchemaFromAvroSchemaFile)
     .option("header","true")
     .option("inferSchema","false")
     .csv("CSVFileUri")

strict data type check not imposed while writing, 
it applied only while reading dataframe
Is it possible to validate without making read call as it could be expensive operation?
Also in case of double type show some strange behavior 
if we have avro schema 

{
  "namespace":"com.test.schema.validation",
  "name" : "example",
  "type" : "record",
  "fields" [
    {"name":"item_id","type":["null","string"],"default":null},
    {"name":"item_price","type":["null","double"],"default":null}
   ] 
}
CSV file
item_id|item_price
    1|234.90 
    2|634.90
    3|534.90
    4|233A40.90
    5|233E12
df.show(10)- gives me following
    item_id|item_price
    1|234.90 
    2|634.90
    3|534.90
    4|233.90 
    5|2.3E13
Value is Row#4 truncated without any failure so it's hard catch
Please suggest if you have any efficient way to validate schema
Have you come across double type value truncation?

我假设您正在使用scala,因此我的建议将使用案例类来定义您的架构。您可以做以下操作:

case class Item(item_price: Long, item_id: Long)
val item = spark.
  read.
  schema(schema).
  csv("path").
  as[Item]

让我知道您对此的看法。

建议阅读Databricks的这篇文章。

相关内容

  • 没有找到相关文章

最新更新