在Pyspark中的Groupby之后，如何计算独特的ID

我每年都使用以下代码来同意学生。目的是了解每年的学生总数。

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

我发现重复了许多ID的问题，因此结果是错误的。

我想在一年中达成一致的学生，计算一年的学生总数，并避免重复ID。

使用 countdistinct 函数

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])
gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

输出

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

您也可以做：

gr.groupBy("year", "id").count().groupBy("year").count()

此查询将每年返回独特的学生。

countDistinct()和多个aggr在流中不支持。

如果您正在使用旧的Spark版本并且没有countDistinct函数，则可以使用size和collect_set函数的组合来复制它：

gr = gr.groupBy("year").agg(fn.size(fn.collect_set("id")).alias("distinct_count"))

如果您必须在多个列上计数不同，只需使用concat将列与新列相连并执行与上述相同。

通过使用spark/pyspark sql

y.createOrReplaceTempView("STUDENT")
    
spark.sql("SELECT year, count(DISTINCT id) as count" + 
"FROM STUDENT group by year").show()

相关内容

最新更新

热门标签：