比较pyspark中一个列的structfield中的两个值



我有列,每一行是一个StructField。我想在StructField中获得两个值的最大值。

I tried this

trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key))

但是会抛出这个错误

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

我现在正在用udf完成它

max_key = lambda x: x if x else float("-inf")
_get_max_udf = udf(lambda x, y: max(x,y, key=max_key), FloatType())
trends_df = trends_df.withColumn("importance_score", _get_max_udf(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))

这是有效的,但是我想知道是否有一种方法可以避免使用udf,而只使用spark来完成它。

编辑:这是trends_df.printSchema()

的结果
root
|-- avg_total: struct (nullable = true)
|    |-- max: struct (nullable = true)
|    |    |-- avg_percent: double (nullable = true)
|    |    |-- max_index: long (nullable = true)
|    |    |-- max_val: long (nullable = true)
|    |    |-- total_percent: double (nullable = true)
|    |    |-- total_val: long (nullable = true)
|    |-- min: struct (nullable = true)
|    |    |-- avg_percent: double (nullable = true)
|    |    |-- min_index: long (nullable = true)
|    |    |-- min_val: long (nullable = true)
|    |    |-- total_percent: double (nullable = true)
|    |    |-- total_val: long (nullable = true)

从评论中添加答案以突出显示

正如@smurphy回答的那样,我使用了greatest函数

trends_df = trends_df.withColumn("importance_score", greatest(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html pyspark.sql.functions.greatest

最新更新