Spark SQL中的Spark Dataframe的dropDuplicates相当于什么?



我在Spark(3.0/3.1)中有一些代码这样写:

foo.join(bar, Seq("col1","col2","col3"),"inner").dropDuplicates("col1","col2")

其中foobar是两个通用的数据帧。

如何转换为Spark SQL?我找不到dropDuplicates的等效定义:

select distinct(col1, col2), * ....

似乎有些不同。任何想法?

您可以使用排序函数- row_number()

val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df = List((1, 2, 3), (1, 2, 4), (1, 4, 6)).toDF("col1", "col2", "col3")
df.dropDuplicates("col1", "col2").show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|   3|
|   1|   4|   6|
+----+----+----+*/
df.createOrReplaceTempView("table")
spark.sql(
"""
|select col1, col2, col3 from (
|   select *, row_number() over (partition by col1, col2 order by col1, col2) as rn from table)
|   where rn = 1
|""".stripMargin).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|   3|
|   1|   4|   6|
+----+----+----+*/