Alteryx regex_countmatches equivalent in PySpark?

我正在进行一些alteryx工作流迁移到PySpark任务，其中一部分遇到了以下过滤条件。

length([acc_id]) = 9
AND 
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND 
REGEX_CountMatches(left([acc_id],2),"[[:alpha:]]")=2)
OR
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND 
REGEX_CountMatches(left([acc_id],1),"[[:alpha:]]")=1 AND 
REGEX_CountMatches(right(left([acc_id],2),1), '9')=1 
)

有人能帮我在PySpark数据框架中重写这个条件吗?

您可以使用length与regexp_replace来获得Alteryx的REGEX_CountMatches函数的等效:

REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0

就变成:

# replace all non aplhapetic caracters with '' then get length
F.length(F.regexp_replace(F.expr("right(acc_id, 7)"), '[^A-Za-z]', '')) == 0

right和left函数仅在SQL中可用，您可以在expr中使用它们。

完整例子:

from pyspark.sql import functions as F

df = spark.createDataFrame([("AB1234567",), ("AD234XG1234TT5",)], ["acc_id"])
def regex_count_matches(c: Column, regex: str) -> Column:
"""
helper function equivalent to REGEX_CountMatches
"""
return F.length(F.regexp_replace(c, regex, ''))

df.filter(
(F.length("acc_id") == 9) &
(
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 2)"), '[^A-Za-z]') == 2)
) | (
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 1)"), '[^A-Za-z]') == 1)
& (regex_count_matches(F.expr("right(left(acc_id, 2), 1)"), '[^9]') == 1)
)
).show()
#+---------+
#|   acc_id|
#+---------+
#|AB1234567|
#+---------+

您可以使用size和split。你还需要使用'[a-zA-Z]'作为正则表达式，因为像"[[:alpha:]]"这样的表达式在Spark中不支持。

例如,

REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0

应该等同于(在Spark SQL中)

size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0

您可以将Spark SQL字符串直接放入Spark数据框的filter子句中:

df2 = df.filter("size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0")

相关内容

最新更新

热门标签：