希望基于字符串列创建一个新列,该字符串列具有分隔符("),如果后面跟着数字,则跳过分割,最后删除";最后如果存在使用python/pyspark:
输入:
"511 520 NA 611;"
"322 GA 620"
"3 321;"
"334344"
期望输出:
+Column | +new column
"511 520 NA 611;" | [511,520,NA 611]
"322 GA 620" | [322,GA 620]
"3 321; " | [3,321]
"334 344" | [334,344]
试题:
data = data.withColumn(
"newcolumn",
split(col("column"), "s"))
但是我在数组末尾得到一个空字符串,就像这里我想删除它,如果存在
+Column | +new column
"511 520 NA 611;" | [511,520,NA,611;]
"322 GA 620" | [322,GA,620]
"3 321;" | [3,321;]
"334 344" | [334,344]
您可以使用regexp_replace替换";"先在字符串的末尾,然后执行split。正则表达式";$"表示匹配以"&;;&;"结尾的字符串。
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, regexp_replace
spark = SparkSession.builder.getOrCreate()
data = [
("511 520 NA 611;",),
("322 GA 620",),
("3 321;",),
("334 344",)
]
df = spark.createDataFrame(data, ['column'])
df = df.withColumn("newcolumn", split(regexp_replace(col("column"), ';$', ''), "\s"))
df.show(truncate=False)
正如在注释中提到的,您可以将regexp_extract_all与正确的regexp一起使用,如下所示:
from pyspark.sql import functions as F
data = [
["511 520 NA 611;"],
["322 GA 620"],
["3 321;"],
["334344"]
]
df = spark.createDataFrame(data, ["value"])
df.withColumn("extracted_value", F.expr("regexp_extract_all(value, '(\d+)|(\w+\s\d+)', 0)")).show()
# +---------------+------------------+
# | value| extracted_value|
# +---------------+------------------+
# |511 520 NA 611;|[511, 520, NA 611]|
# | 322 GA 620| [322, GA 620]|
# | 3 321;| [3, 321]|
# | 334344| [334344]|
# +---------------+------------------+