根据JSON定义验证csv



我是Scala的新手,想知道验证CSV文件的最佳方法是什么优选使用映射函数并添加新的列,这取决于条件是否满足。我想把它作为我在ApacheSpark中的数据帧的UDF函数。

这是模式:

Record Type  val1  val2  val3  
TYPE1         1    2    ZZ 
TYPE2         2   555   KK

我想验证的JSON定义是:

"rows" :
{ 
"TYPE1" :
"fields" : [
{
"required" : "true",
"regex":  "TYPE1",
},
{
"required" : true",
"regex" :"[a-zA-Z]{2}[a-zA-Z]{2}",
"allowed_values": null  
},
{
"required" : true",
"regex" :"[a-zA-Z]{2}[a-zA-Z]{2}",
"allowed_values" : ["ZZ","KK"]
}
]
}

我不确定你的JSON定义(它也缺少一些引号和大括号(,也不确定Record Type是否是CSV中的一列,但这里有一个简化——如果需要,你可以在它周围添加"Record Type"逻辑。

假设文件验证器.json:

{
"fields" : [
{
"name" : "val1",
"regex": "[0-9]+"
},{
"name" : "val2",
"regex" :"[0-9]+"
},{
"name" : "val3",
"regex" :"[A-Z]{2}"
}
]
}

通常,默认情况下(没有关于架构的额外选项(spark.read.format("csv").option("header", "true").load("file.csv")将对文件中的所有列使用Strings。这里,假设您有一个标题val1,val2,val3,作为CSV的第一行。等效定义的DF内联:

val df = Seq(("1", "2", "ZZ"), ("2", "555", "KK")).toDF("val1", "val2", "val3")
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.databind.ObjectMapper
import scala.io.Source
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
// read the validator as one long string
val jsonString = Source.fromFile("validator.json").getLines.mkString("")
// map the json string into an object (nested map)
val regexMap:Map[String,Seq[Map[String,String]]] = mapper.readValue(jsonString, classOf[Map[String, Seq[Map[String, String]]]])
//val1 rlike '[0-9]+' AND val2 rlike '[0-9]+' AND val3 rlike '[A-Z]{2}'
val exprStr:String = regexMap("fields").map((fieldDef:Map[String, String]) => s"${fieldDef("name")} rlike '${fieldDef("regex")}'").mkString(" AND ")
// this asks whether all rows match
val matchingRowCount:Long = df.filter(expr("val1 rlike '[0-9]+' AND val2 rlike '[0-9]+' AND val3 rlike '[A-Z][A-Z]'")).count
// if the counts match, then all of the rows follow the rules
df.count == matchingRowCount
// this adds a column about whether the row matches
df.withColumn("matches",expr(exprStr)).show

结果:

+----+----+----+-------+
|val1|val2|val3|matches|
+----+----+----+-------+
|   1|   2|  ZZ|   true|
|   2| 555|  KK|   true|
+----+----+----+-------+

最新更新