添加基于合规性分析器的检查



这是我正在使用的示例数据帧(df(:

+---+----+--------+
| id|orig|scrubbed|
+---+----+--------+
|  1|   a|       a|
|  2|   B|       b|
|  3|   c|       c|
|  4|   D|       d|
|  5|   *|      XX|
|  6|   $|      XX|
|  7|  ZZ|      ZZ|
|  8|  XX|      XX|
|  9|   y|       y|
| 10|   Z|       z|
+---+----+--------+

我想执行一个检查,告诉我清理后"填充"(不包含"XX"或"ZZ"(的项目比例是否至少为 80%。(此检查应失败。我可以向VerificationRunBuilder添加合规性分析器来计算指标,如下所示:

val myVerificationResult: VerificationResult = new VerificationRunBuilder(df).
addRequiredAnalyzer(
Compliance(
"populatedAfterScrubbing",
"`scrubbed` NOT IN ('ZZ', 'XX') AND `scrubbed` IS NOT NULL",
Some("`orig` NOT IN ('ZZ', 'XX') AND `orig` IS NOT NULL")
)
).
addCheck(
Check(CheckLevel.Error, "Review Check").
hasSize(_ >= 1)
).
run()

此代码运行并使用hasSize约束成功检查数据,但我无法弄清楚如何基于我的自定义合规性分析器添加约束。这可能吗?

我找到了一个似乎有效的解决方案,以防有人感兴趣。答案在于创建自定义约束而不是自定义分析器。这是工作代码:

val myConstraint = Constraint.complianceConstraint(
"my constraint",
"`scrubbed` NOT IN ('ZZ', 'XX') AND `scrubbed` IS NOT NULL",
(fraction:Double)=>fraction>=0.8,
Some("`orig` NOT IN ('ZZ', 'XX') AND `orig` IS NOT NULL"),
Some("no peeking")
)
val myVerificationResult: VerificationResult = { VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Review Check") 
.addConstraint(myConstraint)
)
.run()
}
val result = checkResultsAsDataFrame(spark, myVerificationResult)
result.show(truncate=true)

结果完全符合预期:

+------------+-----------+------------+--------------------+-----------------+--------------------+
|       check|check_level|check_status|          constraint|constraint_status|  constraint_message|
+------------+-----------+------------+--------------------+-----------------+--------------------+
|Review Check|      Error|       Error|ComplianceConstra...|          Failure|Value: 0.75 does ...|
+------------+-----------+------------+--------------------+-----------------+--------------------+

这不能通过使用统计信息 https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/checks/Check.scala#L667 类似的东西来检查来完成

Check(CheckLevel.Warning, "Statisfies TEST Constraint")      
.satisfies("`scrubbed` NOT IN ('ZZ', 'XX') AND `scrubbed` IS NOT NULL",
"my constraint",
"fraction:Double",(fraction:Double)=>fraction>=0.8,
Some("..."))
))

我认为这是 OOB 而不是通过合规性约束来定义的,尽管如果您有涉及的逻辑,这也是一个好主意。

相关内容

  • 没有找到相关文章

最新更新