将字符串拆分为Spark中的字符数组



如何将字符串列拆分为字符数组?

输入:

from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
# +----------+
# |col_cities|
# +----------+
# |   Vilnius|
# |      Riga|
# |   Tallinn|
# |  New York|
# +----------+

所需输出:

# +----------+------------------------+
# |col_cities|split                   |
# +----------+------------------------+
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
# +----------+------------------------+

您可以使用具有负前瞻性的regex模式的split:

df.withColumn('split', F.split('col_cities', '(?!$)'))

+----------+------------------------+
|col_cities|split                   |
+----------+------------------------+
|Vilnius   |[V, i, l, n, i, u, s]   |
|Riga      |[R, i, g, a]            |
|Tallinn   |[T, a, l, l, i, n, n]   |
|New York  |[N, e, w,  , Y, o, r, k]|
+----------+------------------------+

split可通过提供空字符串''作为分隔符使用。但是,它将返回空字符串作为最后一个数组的元素。因此,slice需要删除最后一个数组的元素。

split = "split(col_cities, '')"
split = F.expr(f'slice({split}, 1, size({split})-1)')
df.withColumn('split', split).show(truncate=0)
# +----------+------------------------+
# |col_cities|split                   |
# +----------+------------------------+
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
# +----------+------------------------+

最新更新