如何使用PySpark在另一列中找到子字符串列的位置?

如果我有一个PySpark DataFrame有两列，text和subtext，其中subtext保证发生在text的某个地方。我如何计算subtext在text列中的位置?

输入数据:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

预期输出:

+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

注意:我可以使用静态文本/正则表达式做到这一点，没有问题，我没有找到任何关于使用行特定文本/正则表达式做到这一点的资源。

您可以使用locate。您需要减去1，因为字符串索引从1开始，而不是0。

import pyspark.sql.functions as F
df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
df2.show(truncate=False)
+-------------------------+-------+--------+
|text                     |subtext|position|
+-------------------------+-------+--------+
|Where is my string?      |is     |6       |
|Hm, this one is different|on     |9       |
+-------------------------+-------+--------+

使用positionSQL函数的另一种方法:

from pyspark.sql.functions import expr
df1 = df.withColumn('position', expr("position(subtext in text) -1"))
df1.show(truncate=False)
#+-------------------------+-------+--------+
#|text                     |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string?      |is     |6       |
#|Hm, this one is different|on     |9       |
#+-------------------------+-------+--------+

pyspark.sql.functions.instr(str, substr)

找到substr列在给定字符串中第一次出现的位置。

import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))

可以使用locate本身。问题是locate (substr)的第一个参数应该是字符串。

所以你可以使用expr函数将列转换为字符串

请查找正确的代码如下:

df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))

相关内容

最新更新

热门标签：