Spark SQL如何通过聚合实现组?我想根据姓名字段和最新数据进行分组,以获得最新的工资。如何编写SQL
数据为:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
预期结果是:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
假设数据帧注册为名为tmp
的临时视图,首先按date
的相反顺序为每个组(name
)使用row_number
窗口函数分配行号(rn
),然后取所有具有rn=1
的行。
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
- 首先将字符串转换为日期
- 将日期隐藏到UNixTimestamp中。(日期的数字表示,因此可以使用Max)
- 用户";第一个";作为一个集合函数,用于检索聚合结果的值。(第一个结果,所以如果有日期平局,它可能会拉任何一个。)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
我通常的技巧是"zip";date
和salary
在一起(取决于你想先排序什么)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+