如何将字符串列拆分为字符数组?
输入:
from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
# +----------+
# |col_cities|
# +----------+
# | Vilnius|
# | Riga|
# | Tallinn|
# | New York|
# +----------+
所需输出:
# +----------+------------------------+
# |col_cities|split |
# +----------+------------------------+
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# +----------+------------------------+
您可以使用具有负前瞻性的regex模式的split
:
df.withColumn('split', F.split('col_cities', '(?!$)'))
+----------+------------------------+
|col_cities|split |
+----------+------------------------+
|Vilnius |[V, i, l, n, i, u, s] |
|Riga |[R, i, g, a] |
|Tallinn |[T, a, l, l, i, n, n] |
|New York |[N, e, w, , Y, o, r, k]|
+----------+------------------------+
split
可通过提供空字符串''
作为分隔符使用。但是,它将返回空字符串作为最后一个数组的元素。因此,slice
需要删除最后一个数组的元素。
split = "split(col_cities, '')"
split = F.expr(f'slice({split}, 1, size({split})-1)')
df.withColumn('split', split).show(truncate=0)
# +----------+------------------------+
# |col_cities|split |
# +----------+------------------------+
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# +----------+------------------------+