我有一个简单的数据集,我正在尝试根据"名字"列对元素进行排序。我在 scala 中使用了 orderby 和 sort,但它给出了一些奇怪的输出。
scala> val baseData = data.select($"Account.Number".as("AccountNumber"),
$"Account.FirstName".as("FirstName"),
$"Account.LastName".as("LastName"))
baseData: org.apache.spark.sql.DataFrame =
[AccountNumber: string, FirstName: string ... 1 more field]
scala> baseData.show(false)
+-------------+---------+--------+
|AccountNumber|FirstName|LastName|
+-------------+---------+--------+
|123-ABC-789 |Jay |Smith |
|456-DEF-456 |Sally |Fuller |
|333-XYZ-999 |Brad |Turner |
|987-CBA-321 |Justin |Pihony |
|123-ABC-789 |Jay |Smith |
|456-DEF-456 |Sally |Fuller |
|123-ABC-789 |Jay |Smith |
|456-DEF-456 |Sally |Fuller |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|987-CBA-321 |Justin |Pihony |
|123-ABC-789 |Jay |Smith |
|456-DEF-456 |Sally |Fuller |
|333-XYZ-999 |Brad |Turner |
|456-DEF-456 |Sally |Fuller |
|987-CBA-321 |Justin |Pihony |
|456-DEF-456 |Sally |Fuller |
|456-DEF-456 |Sally |Fuller |
|123-ABC-789 |Jay |Smith |
+-------------+---------+--------+
only showing top 20 rows
scala> baseData.sort($"FirstName").show(false)
+-------------+---------+--------+
|AccountNumber|FirstName|LastName|
+-------------+---------+--------+
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|333-XYZ-999 |Brad |Turner |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
|123-ABC-789 |Jay |Smith |
+-------------+---------+--------+
only showing top 20 rows
我得到了一些重复的行。我尝试了排序和排序,但两者都会导致重复的行。
要消除重复的行,请在代码末尾给出.dropDuplicates
,这将显示不同的记录。
baseData.sort($"FirstName").dropDuplicates.show(false)
若要动态显示数据帧内的所有元素,请使用show
方法的重载方法,将数据帧计数作为第一个参数传递。
baseData.sort($"FirstName").dropDuplicates.show(baseData.count().toInt,false)