如何最大值并保留所有列(对于每组最大记录)



给定以下数据框架:

+----+-----+---+-----+
| uid|    k|  v|count|
+----+-----+---+-----+
|   a|pref1|  b|  168|
|   a|pref3|  h|  168|
|   a|pref3|  t|   63|
|   a|pref3|  k|   84|
|   a|pref1|  e|   84|
|   a|pref2|  z|  105|
+----+-----+---+-----+

如何从uidk获得最大值,但包括v

+----+-----+---+----------+
| uid|    k|  v|max(count)|
+----+-----+---+----------+
|   a|pref1|  b|       168|
|   a|pref3|  h|       168|
|   a|pref2|  z|       105|
+----+-----+---+----------+

我可以做这样的事情,但它会删除" V"列:

df.groupBy("uid", "k").max("count")

它是窗口操作员(使用over函数)或join的完美示例。

由于您已经弄清楚了如何使用Windows,因此我专注于join

scala> val inventory = Seq(
     |   ("a", "pref1", "b", 168),
     |   ("a", "pref3", "h", 168),
     |   ("a", "pref3", "t",  63)).toDF("uid", "k", "v", "count")
inventory: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 2 more fields]
scala> val maxCount = inventory.groupBy("uid", "k").max("count")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]
scala> maxCount.show
+---+-----+----------+
|uid|    k|max(count)|
+---+-----+----------+
|  a|pref3|       168|
|  a|pref1|       168|
+---+-----+----------+
scala> val maxCount = inventory.groupBy("uid", "k").agg(max("count") as "max")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]
scala> maxCount.show
+---+-----+---+
|uid|    k|max|
+---+-----+---+
|  a|pref3|168|
|  a|pref1|168|
+---+-----+---+
scala> maxCount.join(inventory, Seq("uid", "k")).where($"max" === $"count").show
+---+-----+---+---+-----+
|uid|    k|max|  v|count|
+---+-----+---+---+-----+
|  a|pref3|168|  h|  168|
|  a|pref1|168|  b|  168|
+---+-----+---+---+-----+

这是我到目前为止提出的最好的解决方案:

val w = Window.partitionBy("uid","k").orderBy(col("count").desc)
df.withColumn("rank", dense_rank().over(w)).select("uid", "k","v","count").where("rank == 1").show

您可以使用窗口函数:

from pyspark.sql.functions import max as max_
from pyspark.sql.window import Window
w = Window.partitionBy("uid", "k")
df.withColumn("max_count", max_("count").over(w))

相关内容

  • 没有找到相关文章

最新更新