DataFrame sql - Spark scala order by 不是给出正确的顺序



我有一个简单的数据集,我正在尝试根据"名字"列对元素进行排序。我在 scala 中使用了 orderby 和 sort,但它给出了一些奇怪的输出。

    scala> val baseData = data.select($"Account.Number".as("AccountNumber"),
 $"Account.FirstName".as("FirstName"),
 $"Account.LastName".as("LastName"))
    baseData: org.apache.spark.sql.DataFrame = 
             [AccountNumber: string, FirstName: string ... 1 more field]
    scala>  baseData.show(false)
    +-------------+---------+--------+
    |AccountNumber|FirstName|LastName|
    +-------------+---------+--------+
    |123-ABC-789  |Jay      |Smith   |
    |456-DEF-456  |Sally    |Fuller  |
    |333-XYZ-999  |Brad     |Turner  |
    |987-CBA-321  |Justin   |Pihony  |
    |123-ABC-789  |Jay      |Smith   |
    |456-DEF-456  |Sally    |Fuller  |
    |123-ABC-789  |Jay      |Smith   |
    |456-DEF-456  |Sally    |Fuller  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |987-CBA-321  |Justin   |Pihony  |
    |123-ABC-789  |Jay      |Smith   |
    |456-DEF-456  |Sally    |Fuller  |
    |333-XYZ-999  |Brad     |Turner  |
    |456-DEF-456  |Sally    |Fuller  |
    |987-CBA-321  |Justin   |Pihony  |
    |456-DEF-456  |Sally    |Fuller  |
    |456-DEF-456  |Sally    |Fuller  |
    |123-ABC-789  |Jay      |Smith   |
    +-------------+---------+--------+
    only showing top 20 rows

    scala> baseData.sort($"FirstName").show(false)
    +-------------+---------+--------+
    |AccountNumber|FirstName|LastName|
    +-------------+---------+--------+
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |333-XYZ-999  |Brad     |Turner  |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    |123-ABC-789  |Jay      |Smith   |
    +-------------+---------+--------+
    only showing top 20 rows

我得到了一些重复的行。我尝试了排序和排序,但两者都会导致重复的行。

要消除重复的行,请在代码末尾给出.dropDuplicates,这将显示不同的记录。

baseData.sort($"FirstName").dropDuplicates.show(false)

若要动态显示数据帧内的所有元素,请使用show方法的重载方法,将数据帧计数作为第一个参数传递。

baseData.sort($"FirstName").dropDuplicates.show(baseData.count().toInt,false)

最新更新