PySpark:如何为数组列中的每个元素添加值



我在PySpark中有一列DF数组,我想把数字1加到每个数组中的每个元素上。这是DF:

+--------------------+
|             growth2|
+--------------------+
|[0.041305445, 0.0...|
|[0.027677462, 0.0...|
|[-0.0027841541, 0...|
|[-0.003083522, 0....|
|[0.03309798, -0.0...|
|[-0.0030860472, 0...|
|[0.01870109, -0.0...|
|[0.0, 0.0, 0.0, 0...|
|[0.030841235, 0.0...|
|[-0.07487654, 0.0...|
|[-0.0030791108, 0...|
|[0.010564512, 0.0...|
|[0.017113779, 0.0...|
|[-0.0030568982, 0...|
|[0.8942986, 0.020...|
|[0.039178953, 0.0...|
|[-0.020131985, -0...|
|[0.09150412, -0.0...|
|[0.024969723, 0.0...|
|[0.017103601, -0....|
+--------------------+
only showing top 20 rows

这是第一排:

Row(growth2=[0.041305445, 0.046466704, 0.16028039, 0.05724156, 0.03765997, 0.103110574, 0.031785928, 0.04724884, -0.028079592, 0.009382707, -0.25695816, 0.19432063, 0.061015617, 0.09409759, 0.12152613, 0.039392408, 0.989114, 0.04910219, 0.46904725, 0.0])

因此,输出看起来像:

Row(growth2=[1.041305445, 1.046466704, 1.16028039, 1.05724156, 1.03765997, 1.103110574, 1.031785928, 1.04724884, -1.028079592, 1.009382707, -1.25695816, 1.19432063, 1.061015617, 1.09409759, 1.12152613, 1.039392408, 1.989114, 1.04910219, 1.46904725, 1.0])

有没有PySpark函数可以实现这一点?我想避免编写Pandas UDF,因为我有5000多万行,与本机解决方案相比,操作速度较慢。

Spark提供了更高阶的函数来本地操作数组:

import pyspark.sql.functions as f

df = df.withColumn('growth2', f.expr('TRANSFORM(growth2, el -> el + 1)'))

最新更新