如何在多个列上与pyspark数据帧并行运行UDF

我有4列的pyspark数据帧

1) Country
2) col1 [numeric]
3) col2 [numeric]
4) col3 [numeric]

我有udf，它取数字并将其格式化为xx.xx(2个小数点(使用"；带有列"；函数我可以调用udf并格式化数字。

示例：

df=df.withColumn("col1", num_udf(df.col1))
df=df.withColumn("col2", num_udf(df.col2))
df=df.withColumn("col3", num_udf(df.col3))

我想要的是，我们可以在每个列上并行运行这个udf，而不是按顺序运行。

不确定为什么要并行运行它，但可以通过使用rdd和map:来实现

temp = spark.createDataFrame(
[(1, 2, 3)],
schema=['col1', 'col2', 'col3']
)
temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |2   |3   |
+----+----+----+
# You can replace +1 to your udf in the lambda
temp = temp.rdd.map(
lambda row: (row[0]+ 1, row[1] + 1, row[2] + 1)
).toDF(['col1', 'col2', 'col3'])
temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|2   |3   |4   |
+----+----+----+

您还可以从如下python函数创建udf函数：

from pyspark.sql.functions import udf
def formatNumber(x):
if x is not None :
return "%0.2f"%x
else:
return None
formatNumberUdf = udf(formatNumber)
df=df.withColumn("col1", formatNumberUdf('col1'))
df=df.withColumn("col2", formatNumberUdf('col2'))
df=df.withColumn("col3", formatNumberUdf('col3'))

相关内容

最新更新

热门标签：