如何在SQL窗口上获得组中的第一个元素?



我想获得组的第一个元素,但是必须为每个窗口计算组。我想这样做:

模式

tbody> <<tr>
table
id
capture_date
event_date

这是一个全SQL版本,但如果需要,可以在spark/pyspark中重写。我使用的是groupby,但您也可以使用row_number &其中

with raw_averages as ( -- short cut for subquery
SELECT
AVG(
target
) OVER (
PARTITION BY groups
ORDER BY event_date
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) as average, 
ID, 
capture_date,
event_date
FROM table ),
grouped_result as -- more shortcut for subquery
(SELECT 
id, 
avg(average) as average, -- the average of the entire group is the same -->math trick
reverse( -- sort descending
array_sort( --sort ascending by first item ( event_date )
arrays_zip( -- create one array of below arrays
collect_list( event_date ), -- collect the grouped items *has to be first to get the ordering you want*
collect_list( capture_date ) -- collect the grouped items
)
)
)[0] as values --getting first will return max (reference first item in array)
from raw_averages 
GROUP BY groups, id 
-- HAVING values.`0` = values.`1` -- having might work here but I didn't explore it
)
select 
groups, 
id, 
average,
values.`0` as event_date -- awkward syntax because of arrays_zip
values.`1` as capture_date
from 
grouped_result
where values.`0` = values.`1`

相关内容

  • 没有找到相关文章

最新更新