如何过滤违反约束的行



为了对我的数据进行一些单元测试,我使用了PyDeequ。有没有办法过滤掉违反定义约束的行?我在网上找不到任何东西。这是我的代码:

df1 = (spark
.read
.format("csv")
.option("header", "true")
.option("encoding", "ISO-8859-1")
.load("addresses.csv", sep = ','))
check = Check(spark, CheckLevel.Warning, "Review Check")
checkResult = (VerificationSuite(spark)
.onData(df1)
.addCheck(
check
.isComplete("Nome")
.isComplete("Citta")
.isUnique("CAP")
.isUnique("Number")
.isContainedIn("Number", ("11","12","13","14","15","16"))
)
.run())
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

您应该查找checkResult_df中的constraint_status等于Failure的位置。

以上面的例子为基础:

from pydeequ.checks import Check, CheckLevel, ConstrainableDataTypes
from pydeequ.verification import VerificationResult, VerificationSuite
from pyspark.sql import functions as F
df1 = (spark
.read
.format("csv")
.option("header", "true")
.option("encoding", "ISO-8859-1")
.load("addresses.csv", sep = ','))
check = Check(spark, CheckLevel.Warning, "Review Check")
checkResult = (VerificationSuite(spark)
.onData(df1)
.addCheck(
check
.isComplete("Nome")
.isComplete("Citta")
.isUnique("CAP")
.isUnique("Number")
.isContainedIn("Number", ("11","12","13","14","15","16"))
)
.run())
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
# Added this snippet
# Filtering for any failed data quality constraints
df_checked_constraints_failures = 
(checkResult_df
.filter(F.col("constraint_status") == "Failure"))

提醒或记录这些故障也可能有帮助:

import logging
logger = logging.getLogger(__name__)
# If any data quality check fails, log/raise exception/alert Slack
if df_checked_constraints_failures.count() > 0:
logger.info(
df_checked_constraints_failures.show(n=df_checked_constraints_failures.count(),
truncate=False)
)
# maybe raise exception here
# maybe send POST message to Slack webhook for channel that monitors applications

最新更新