我正在尝试运行模式检查的示例代码"hasPattern(("使用PyDeequ,并使用异常失败
代码:
import pydeequ
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", creditCard="5130566665286573", email="foo@example.com", ssn="123-45-6789",
URL="http://userid@example.com:8080"),
Row(a="bar", creditCard="4532677117740914", email="bar@example.com", ssn="123456789",
URL="http://example.com/(something)?after=parens"),
Row(a="baz", creditCard="3401453245217421", email="foobar@baz.com", ssn="000-00-0000",
URL="http://userid@example.com:8080")]).toDF()
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Error, "Integrity checks")
checkResult = VerificationSuite(spark)
.onData(df)
.addCheck(
check.hasPattern(column='email',
pattern=r".*@baz.com",
assertion=lambda x: x == 1 / 3))
.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()
运行后我收到:
AttributeError: 'NoneType' object has no attribute '_Check'
在线
check.hasPattern(column='email',
pattern=r".*@baz.com",
assertion=lambda x: x == 1 / 3)
PyDeequ版本:1.0.1Python版本:Python 3.7.9
此时此刻,pydeequ存储库上的代码实际上并没有完全充实函数定义。它有一个指示所需行为的文档字符串,但似乎没有任何附带的代码来完成实际工作。
如果没有任何代码来执行此测试,函数将始终返回值None
(Python函数的默认行为(。
pydeequ中检查方法的正确预期行为是返回check
对象(由self参数表示(,这将允许用户按顺序菊花链连接多个检查。
为了进行比较,我提供了hasPattern
(未完全编码,仅包含docstring(方法和似乎已完全实现的containsCreditCardNumber
方法的代码片段。
hasPattern
def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
"""
Checks for pattern compliance. Given a column name and a regular expression, defines a
Check on the average compliance of the column's values to the regular expression.
:param str column: Column in DataFrame to be checked
:param Regex pattern: A name that summarizes the current check and the
metrics for the analysis being done.
:param lambda assertion: A function with an int or float parameter.
:param str name: A name for the pattern constraint.
:param str hint: A hint that states why a constraint could have failed.
:return: hasPattern self: A Check object that runs the condition on the column.
"""
包含信用卡号
def containsCreditCardNumber(self, column, assertion=None, hint=None):
"""
Check to run against the compliance of a column against a Credit Card pattern.
:param str column: Column in DataFrame to be checked. The column is expected to be a string type.
:param lambda assertion: A function with an int or float parameter.
:param hint hint: A hint that states why a constraint could have failed.
:return: containsCreditCardNumber self: A Check object that runs the compliance on the column.
"""
assertion = (
ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
if assertion
else getattr(self._Check, "containsCreditCardNumber$default$2")()
)
hint = self._jvm.scala.Option.apply(hint)
self._Check = self._Check.containsCreditCardNumber(column, assertion, hint)
return self
我仍然面临同样的错误,即使按照上面的链接,该方法看起来已经实现并合并到master中。事实上,实现是:
def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
"""
Checks for pattern compliance. Given a column name and a regular expression, defines a
Check on the average compliance of the column's values to the regular expression.
:param str column: Column in DataFrame to be checked
:param Regex pattern: A name that summarizes the current check and the
metrics for the analysis being done.
:param lambda assertion: A function with an int or float parameter.
:param str name: A name for the pattern constraint.
:param str hint: A hint that states why a constraint could have failed.
:return: hasPattern self: A Check object that runs the condition on the column.
"""
assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) if assertion
else getattr(self._Check, "hasPattern$default$2")()
name = self._jvm.scala.Option.apply(name)
hint = self._jvm.scala.Option.apply(hint)
pattern_regex = self._jvm.scala.util.matching.Regex(pattern, None)
self._Check = self._Check.hasPattern(column, pattern_regex, assertion_func, name, hint)
return self
但是1.1.0中没有包含。它需要等待另一次发布。