用Pyspark截断字符串

我当前正在使用Databricks在Pyspark上工作，我正在寻找一种像Excel正确函数一样截断字符串的方法。例如，我想将DataFrame 8841673_3中的ID列更改为8841673。

有人知道我应该如何继续吗？

regexp_extract的正则表达式：

from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([("8841673_3", )], ("id", ))
df.select(regexp_extract("id", "^(d+)_.*", 1)).show()
# +--------------------------------+
# |regexp_extract(id, ^(d+)_.*, 1)|
# +--------------------------------+
# |                         8841673|
# +--------------------------------+

regexp_replace：

from pyspark.sql.functions import regexp_replace
df.select(regexp_replace("id", "_.*$", "")).show()
# +--------------------------+
# |regexp_replace(id, _.*$, )|
# +--------------------------+
# |                   8841673|
# +--------------------------+

或仅split：

from pyspark.sql.functions import split
df.select(split("id", "_")[0]).show()
# +---------------+
# |split(id, _)[0]|
# +---------------+
# |        8841673|
# +---------------+

您可以使用pyspark.sql.Column.substr方法：

import pyspark.sql.functions as F
def left(x, n):
    return x.substr(0, n)
def right(x, n):
    x_len = F.length(x)
    return x.substr(x_len - n, x_len)

相关内容

最新更新

热门标签：