多柱的火花UDF最大值;TypeError： float() 参数必须是字符串或数字，而不是'Row'

我正试图从列列表中获取最大值，并获取具有最大值的列的名称，如本文所述PySpark：计算列的子集的行最大值，并添加到现有的数据帧
如何获得PySpark数据帧中具有最大值的列的名称我审查了很多帖子，尝试了很多选择，但都没有成功。

列对象不可调用类型错误：'；列'；对象不可使用WithColumn调用并传递多列Pyspark：在UDF 中传递多列

加载到数据帧的表中的列Rule_Total_Score:双，Rule_No_Identifier_Score：双倍

rules = ['Rule_Total_Score', 'Rule_No_Identifier_Score']
df = spark.sql('select * from  table')
@f.udf(DoubleType())
def get_max_row_with_None(*cols):
return float(max(x for x in cols if x is not None))
sdf = df.withColumn("max_rule", get_max_row_with_None(f.struct([df[col] for col in df.columns if col in rules])))

UDF接受列列表，而不是struct列，所以如果传入列并删除f.struct，它应该有望工作：

@f.udf(DoubleType())
def get_max_row_with_None(*cols):
if all(x is None for x in cols):
return None
else:
return float(max(x for x in cols if x is not None))
sdf = df.withColumn(
"max_rule", 
get_max_row_with_None(*[df[col] for col in df.columns if col in rules])
)

相关内容

最新更新

热门标签：