Spark SQL如何通过聚合实现组



Spark SQL如何通过聚合实现组?我想根据姓名字段和最新数据进行分组,以获得最新的工资。如何编写SQL

数据为:

+-------+------|+---------|
// | name  |salary|date      |
// +-------+------|+---------|
// |AA     |  3000|2022-01   |
// |AA     |  4500|2022-02   |
// |BB     |  3500|2022-01   |
// |BB     |  4000|2022-02   |
// +-------+------+----------|

预期结果是:

+-------+------|
// | name  |salary|
// +-------+------|
// |AA     |  4500|
// |BB     |  4000|
// +-------+------+

假设数据帧注册为名为tmp的临时视图,首先按date的相反顺序为每个组(name)使用row_number窗口函数分配行号(rn),然后取所有具有rn=1的行。

sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
  1. 首先将字符串转换为日期
  2. 将日期隐藏到UNixTimestamp中。(日期的数字表示,因此可以使用Max)
  3. 用户";第一个";作为一个集合函数,用于检索聚合结果的值。(第一个结果,所以如果有日期平局,它可能会拉任何一个。)

:

simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated", 
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
) 
).groupBy("name")
.agg( 
max("dateUpdated"), 
first("salary").alias("Salary") 
).show()
+---------+----------------+------+
|     name|max(dateUpdated)|Salary|
+---------+----------------+------+
|    Sales|      1643691600| 90000|
|  Finance|      1648785600| 90000|
|Marketing|      1651377600| 80000|
+---------+----------------+------+

我通常的技巧是"zip";datesalary在一起(取决于你想先排序什么)

from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
|  AA|[2022-02, 4500]|      4500|
|  BB|[2022-02, 4000]|      4000|
+----+---------------+----------+

相关内容

  • 没有找到相关文章

最新更新