在python/pyspark中,如果数字跟随字符串,则跳过分割以获得数组



希望基于字符串列创建一个新列,该字符串列具有分隔符("),如果后面跟着数字,则跳过分割,最后删除";最后如果存在使用python/pyspark:

输入:

"511 520 NA 611;"
"322 GA 620"  
"3 321;"
"334344"

期望输出:

+Column           | +new column
"511 520 NA 611;" | [511,520,NA 611]
"322 GA 620"      | [322,GA 620]
"3 321; "         | [3,321]
"334 344"         | [334,344]

试题:

data = data.withColumn(
"newcolumn",
split(col("column"), "s"))

但是我在数组末尾得到一个空字符串,就像这里我想删除它,如果存在

+Column        | +new column
"511 520 NA 611;" | [511,520,NA,611;]
"322 GA 620"      | [322,GA,620]
"3 321;"       | [3,321;]
"334 344"      | [334,344]

您可以使用regexp_replace替换";"先在字符串的末尾,然后执行split。正则表达式";$"表示匹配以"&;;&;"结尾的字符串。

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, regexp_replace
spark = SparkSession.builder.getOrCreate()
data = [
("511 520 NA 611;",),
("322 GA 620",),
("3 321;",),
("334 344",)
]
df = spark.createDataFrame(data, ['column'])
df = df.withColumn("newcolumn", split(regexp_replace(col("column"), ';$', ''), "\s"))
df.show(truncate=False)

正如在注释中提到的,您可以将regexp_extract_all与正确的regexp一起使用,如下所示:

from pyspark.sql import functions as F
data = [
["511 520 NA 611;"],
["322 GA 620"],
["3 321;"],
["334344"]
]
df = spark.createDataFrame(data, ["value"]) 
df.withColumn("extracted_value", F.expr("regexp_extract_all(value, '(\d+)|(\w+\s\d+)', 0)")).show()
# +---------------+------------------+
# |          value|   extracted_value|
# +---------------+------------------+
# |511 520 NA 611;|[511, 520, NA 611]|
# |     322 GA 620|     [322, GA 620]|
# |         3 321;|          [3, 321]|
# |         334344|          [334344]|
# +---------------+------------------+

最新更新