pyspark盐:用随机负值替换列中的null

我有许多我正在执行的列连接，有时可以包含数十亿行的无空值，因此我想加列列以防止偏斜，以防止在加入后，如上所述。杰森·埃文（Jason Evan）的帖子：https：//stackoverflow.com/a/43394695

我在python中找不到同等的示例，语法恰好不同，以至于我无法弄清楚如何翻译它。

我大约有一个：

import pyspark.sql.functions as psf
big_neg = -200
for column in key_fields: #key_fields is a list of join keys in the dataframe
    df = df.withColumn(column,
                       psf.when(psf.col(column).isNull(),
                                psf.round(psf.rand().multiply(big_neg))
                      ).otherwise(df[column]))

当前在语法错误上失败：

typeError：'列'对象不可callable

，但我已经尝试了许多语法组合来摆脱TypeError并感到困惑。

我实际上能够在休息后弄清楚。

我认为这对遇到此问题的其他任何人都会有所帮助，因此我会发布我的解决方案：

df = df.withColumn(column, psf.when(df[column].isNull(), psf.round(psf.rand()*(big_neg))).otherwise(df[column]))

相关内容

最新更新

热门标签：