我跟着 https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/并开始运行检查和验证等。
但是我无法找出,我的数据究竟在哪几行上失败了。 这是一个非常重要的部分,我需要未通过检查的行。
我尝试了以下: https://github.com/awslabs/deequ/blob/master/src/test/scala/com/amazon/deequ/schema/RowLevelSchemaValidatorTest.scala 但是,我在从此链接运行代码时收到错误数据砖:
error: object SparkContextSpec is not a member of package com.amazon.deequ
import com.amazon.deequ.SparkContextSpec
^
command-4342528364312961:24: error: not found: type SparkContextSpec
class RowLevelSchemaValidatorTest extends WordSpec with SparkContextSpec {
^
command-4342528364312961:28: error: not found: value withSparkSession
"correctly enforce null constraints" in withSparkSession { sparkSession =>
^
command-4342528364312961:39: error: not found: value RowLevelSchema
val schema = RowLevelSchema()
^
command-4342528364312961:40: error: not found: value isNullable
.withIntColumn("id", isNullable = false)
这样的例子不胜枚举。
请帮忙。
谢谢
您遇到的问题可能是由于项目设置不正确造成的。您是否正在从 IDE 运行测试?如果没有,我建议您确保代码(例如在 IntelliJ 中(编译。然后,单元测试应该可以从那里执行。
IntelliJ带有一个Maven插件,允许导入项目。
import com.amazon.deequ.schema.{RowLevelSchema,RowLevelSchemaValidator}
import org.apache.spark.sql.types.{IntegerType, StringType, TimestampType}
import org.scalatest.WordSpec
import spark.implicits
val data = Seq(
("123", "Product A", "2012-07-22 22:59:59"),
("N/A", "Product B", null),
("456", null, "2012-07-22 22:59:59"),
(null, "Product C", "2012-07-22 22:59:59")
).toDF("id", "name", "event_time")
val schema = RowLevelSchema()
.withIntColumn("id", isNullable = false)
.withStringColumn("name", maxLength = Some(10))
.withTimestampColumn("event_time", mask = "yyyy-MM-dd HH:mm:ss", isNullable = false)
val result = RowLevelSchemaValidator.validate(data, schema)
assert(result.numValidRows == 2)
val validIds = result.validRows.select("id").collect.map { _.getInt(0) }.toSet
assert(validIds.size == result.numValidRows)
assert(validIds.contains(123))
assert(validIds.contains(456))