SQL/Pyspark查询,用同一组的聚合压缩组中的每个值



假设我有一个这样的表

您可以使用窗口函数,按id划分,按date排序。

置>
from pyspark.sql import functions as F
from pyspark.sql import Window
df = spark.createDataFrame(
[(1, '01/04', 7),
(1, '01/03', 2),
(1, '01/02', 10),
(1, '01/01', 4),
(2, '01/04', 7),
(2, '01/03', 7),
(2, '01/02', 8),
(2, '01/01', 5)],
['id', 'date', 'value']
)
df.createOrReplaceTempView('sample_data')

PySpark (dataframe API)

w = Window().partitionBy('id').orderBy('date').rowsBetween(-2, -1)
df = df.withColumn('avg', F.avg('value').over(w))
df.orderBy(F.col('id').asc(), F.col('date').desc()).show()
+---+-----+-----+----+
| id| date|value| avg|
+---+-----+-----+----+
|  1|01/04|    7| 6.0|
|  1|01/03|    2| 7.0|
|  1|01/02|   10| 4.0|
|  1|01/01|    4|null|
|  2|01/04|    7| 7.5|
|  2|01/03|    7| 6.5|
|  2|01/02|    8| 5.0|
|  2|01/01|    5|null|
+---+-----+-----+----+
SQL
spark.sql(
'''
SELECT
id,
date,
value,
AVG(value) OVER (PARTITION BY id ORDER BY date ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) AS avg
FROM sample_data
ORDER BY id, date DESC
''').show()
+---+-----+-----+----+
| id| date|value| avg|
+---+-----+-----+----+
|  1|01/04|    7| 6.0|
|  1|01/03|    2| 7.0|
|  1|01/02|   10| 4.0|
|  1|01/01|    4|null|
|  2|01/04|    7| 7.5|
|  2|01/03|    7| 6.5|
|  2|01/02|    8| 5.0|
|  2|01/01|    5|null|
+---+-----+-----+----+

最新更新