表之间的 PySpark 正则表达式匹配

我正在尝试使用 PySpark 从列中提取正则表达式模式。我有一个包含正则表达式模式的数据框，然后是一个包含我想匹配的字符串的表。

columns = ['id', 'text']
vals = [
 (1, 'here is a Match1'),
 (2, 'Do not match'),
 (3, 'Match2 is another example'),
 (4, 'Do not match'),
 (5, 'here is a Match1')
]
df_to_extract = sql.createDataFrame(vals, columns)

columns = ['id', 'Regex', 'Replacement']
vals = [
(1, 'Match1', 'Found1'),
(2, 'Match2', 'Found2'),
]
df_regex = sql.createDataFrame(vals, columns)

我想在"df_to_extract"的"文本"列中匹配"正则表达式"列。我想提取针对每个 id 的术语，结果表包含 id 和对应于"正则表达式"的"替换"。例如：

+---+------------+
| id| replacement|
+---+------------+
|  1|      Found1|
|  3|      Found2|
|  5|      Found1|
+---+------------+

谢谢！

一种方法是使用 pyspark.sql.functions.expr ，它允许您在连接条件中使用列值作为参数。

例如：

from pyspark.sql.functions import expr
df_to_extract.alias("e")
    .join(
        df_regex.alias("r"), 
        on=expr(r"e.text LIKE concat('%', r.Regex, '%')"),
        how="inner"
    )
    .select("e.id", "r.Replacement")
    .show()
#+---+-----------+
#| id|Replacement|
#+---+-----------+
#|  1|     Found1|
#|  3|     Found2|
#|  5|     Found1|
#+---+-----------+

这里我使用了 sql 表达式：

e.text LIKE concat('%', r.Regex, '%')

它将连接text列类似于Regex列的所有行，%充当通配符以捕获之前和之后的任何内容。

相关内容

最新更新

热门标签：