我怎么能在PySpark中的DataFrame中按和排序呢

类似于：

order_items.groupBy("order_item_order_id").count().orderBy(desc("count")).show()

我试过：

order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("sum")).show()

但这给出了一个错误：

Py4JJavaError:调用o501.sort时出错。：org.apache.spark.sql.AnalysisException：无法解析给定输入列order_item_order_id、sum（order_item_subtotal#429）的"sum"；

我也试过：

order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal)")).show()

但我得到了同样的错误：

Py4JJavaError:调用o512.sort时出错。：org.apache.spark.sql.AnalysisException：无法解析给定输入列order_item_order_id、SUM（order_item_subtotal#429）的"SUM（order _item_substotal）"；

我在执行时得到了正确的结果：

order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal#429)")).show()

但这是在看到Spark附加到和列名的数字后，后验完成的，即#429。

有没有一种方法可以得到相同的结果，但是先验，而不知道会附加哪个数字？

您应该为列使用别名：

import pyspark.sql.functions as func
order_items.groupBy("order_item_order_id")
           .agg(func.sum("order_item_subtotal")
                .alias("sum_column_name"))
           .orderBy("sum_column_name")

相关内容

最新更新

热门标签：