窗口分析功能在其中窗框和顺序是不同的字段



如何计算 end_date 之前的记录的平均记录持续时间此记录的 start_date?

我可以通过自我加入来做到这一点:

  SELECT AVG(p.duration) AS prior_duration
  FROM `bigquery-public-data`.london_bicycles.cycle_hire c
  JOIN `bigquery-public-data`.london_bicycles.cycle_hire p
  ON c.start_station_id = p.start_station_id AND
     p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND)
                  AND c.start_date

但是,我该如何更有效地做到这一点(没有自加入(?沿着:

的线
AVG(duration)
         OVER(PARTITION BY start_station_id
         ORDER BY UNIX_SECONDS(end_date) ASC 
         RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW) AS prior_duration

但是使用当前记录的start_date。

更新:请参阅Mikhail的评论。这行不通。我已经更新了查询,请避免进行快速优化。

这是一个精确的解决方案。这个想法是要在车站上构建所有记录的数组,并使用相关查询在过去一个小时内过滤。处理整个数据集需要7秒钟。

车站的记录阵列大小必须为<不过,100 MB。将您的数组足够小的尽可能多的字段组成:(

WITH all_hires AS (
  SELECT 
    start_station_id
    , ARRAY_AGG(STRUCT(duration, 
                       start_date, 
                       TIMESTAMP_SUB(start_date, INTERVAL 1 HOUR) AS start_date_m1h, 
                       end_date)) AS hires
  FROM `bigquery-public-data`.london_bicycles.cycle_hire
  GROUP BY start_station_id
),
hires_by_ts AS (
  SELECT
    start_station_id
    , h.start_date
    , (SELECT AVG(duration) FROM UNNEST(hires) 
       WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
         AS duration_prev_hour
    , (SELECT COUNT(duration) FROM UNNEST(hires) 
       WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
         AS numreturns_prev_hour
  FROM
    all_hires, UNNEST(hires) AS h
)
SELECT * from hires_by_ts
WHERE duration_prev_hour IS NOT NULL
ORDER BY duration_prev_hour DESC
LIMIT 5

鉴于您不能在订购和窗框边界中使用不同的字段 - 我想到的唯一方法是做两次,并使用警告您可能/但是,可能会错过一些行:

WITH cycle_hires AS (
  SELECT 
    start_station_id,
    start_date,
    ARRAY_AGG(STRUCT(end_date, duration)) OVER (
      PARTITION BY start_station_id
      ORDER BY end_date ASC
      ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
    ) AS previous
  FROM `bigquery-public-data`.london_bicycles.cycle_hire AS c
)
SELECT
  c.start_station_id,
  AVG(p.duration) AS previous_duration,
  COUNT(*) AS number_of_previous_trips_used
FROM cycle_hires AS c
  JOIN UNNEST(previous) AS p
  WHERE p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND) AND c.start_date
GROUP BY 1

使用此数据集(〜24m行(,最多使用100行的行将需要〜20s将其增加到1000行将需要〜120s。

最新更新