类似pyspark中SQL的子查询



我正在尝试进行这种查询:

SELECT age,COUNT(age)
FROM T
GROUP BY age
HAVING age = MIN(SELECT COUNT(age) FROM T GROUP BY age)
ODER BY COUNT(age) 

我试过

min_size = df.groupBy("age").count().select(f.min("count"))
df.groupBy("age").count().sort("count").filter(f.col("count")==min_size).show()

但我得到了AttributeError: 'DataFrame' object has no attribute '_get_object_id'

有什么方法可以在PySpark中使用子查询吗?

在您的情况下,min_size是DataFrame,而不是某个整数
尝试将其收集为如下整数:

min_size = df.groupBy("age").count().select(f.min("count")).collect()[0][0]

最新更新