在SQL中捕获200天的花费窗口开始日期



假设我有一个表,其中有一列为spend_date_id。为了简单起见,该列表示为自第一次花费开始以来的天数,所以我们有一个整数列。

样本数据:

spend_date_id
0
350
450
500
550
650
700 and so on

我需要将这些日期与一个活动窗口相关联,其中每个活动窗口的长度为从最早消费日期起200天。每次活动窗口结束时,下一个可用的消费日期都会开始一个新窗口。

因此,在我们的例子中,这看起来像

spend_date_id    active_window 
0                0
350              1
450              1
500              1
550              2
650              2
700              2 and so on

窗口0:从0开始一直到199。此窗口中的所有日期都将具有此窗口编号,即date_ids 0。

窗口1:从350开始,到549。此窗口中的所有日期都将具有此窗口编号,即date_ids 350、450、500。

请注意,我可以使用的SQL版本BigQuery不允许递归CTE。

下面是BigQuery标准SQL

#standardSQL
CREATE TEMP FUNCTION splits(arr ARRAY<INT64>, size INT64)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var cut = parseInt(arr[0]) + parseInt(size);
var result = [arr[0]];
for(i=1; i<arr.length; i++){
if (arr[i] >= cut) {
result.push(arr[i]); 
cut = parseInt(arr[i]) + parseInt(size);  
}
};
return result;
""";
WITH dirty_split AS (
SELECT spend_date_id, COUNTIF(flag) OVER(ORDER BY spend_date_id) new_big_window
FROM (
SELECT spend_date_id, spend_date_id - LAG(spend_date_id) OVER(ORDER BY spend_date_id) > 200 flag
FROM `project.dataset.table`
)
), precise_split AS (
SELECT new_big_window, splits(ARRAY_AGG(spend_date_id), 200) splits
FROM dirty_split
GROUP BY new_big_window
), numbering AS (
SELECT new_big_window, val, ROW_NUMBER() OVER(ORDER BY new_big_window, val) - 1 AS active_window
FROM precise_split, UNNEST(splits) val
)
SELECT spend_date_id, active_window
FROM `project.dataset.table` t
JOIN numbering n
ON t.spend_date_id BETWEEN n.val AND n.val + 200 - 1
-- ORDER BY spend_date_id

如果适用于您的问题中的样本数据,如下面的示例

#standardSQL
CREATE TEMP FUNCTION splits(arr ARRAY<INT64>, size INT64)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var cut = parseInt(arr[0]) + parseInt(size);
var result = [arr[0]];
for(i=1; i<arr.length; i++){
if (arr[i] >= cut) {
result.push(arr[i]); 
cut = parseInt(arr[i]) + parseInt(size);  
}
};
return result;
""";
WITH `project.dataset.table` AS (
SELECT * FROM UNNEST([0,350,450,500,550,650,700]) AS spend_date_id
), dirty_split AS (
SELECT spend_date_id, COUNTIF(flag) OVER(ORDER BY spend_date_id) new_big_window
FROM (
SELECT spend_date_id, spend_date_id - LAG(spend_date_id) OVER(ORDER BY spend_date_id) > 200 flag
FROM `project.dataset.table`
)
), precise_split AS (
SELECT new_big_window, splits(ARRAY_AGG(spend_date_id), 200) splits
FROM dirty_split
GROUP BY new_big_window
), numbering AS (
SELECT new_big_window, val, ROW_NUMBER() OVER(ORDER BY new_big_window, val) - 1 AS active_window
FROM precise_split, UNNEST(splits) val
)
SELECT spend_date_id, active_window
FROM `project.dataset.table` t
JOIN numbering n
ON t.spend_date_id BETWEEN n.val AND n.val + 200 - 1
ORDER BY spend_date_id

结果是

Row spend_date_id   active_window    
1   0               0    
2   350             1    
3   450             1    
4   500             1    
5   550             2    
6   650             2    
7   700             2

显然,以上内容取决于表中驱动要处理的数组大小的行数。为了帮助解决这个问题,你可以看到,首先我做dirty split,我只是将彼此间隔超过200的组分开,因此没有机会在同一个活动窗口中。因此,这可能会降低最终precise split的阵列尺寸

您也可以通过BigQuery:中的脚本功能来解决此问题

set delta = 199;
set t = array(select spend_date_id from dataset.table);
set buckets = '';
set n = array_length(t);
set i = 0;
set k = 0;
set startw = t[offset(k)];
set endw = t[offset(k)]+delta;
loop 
if i >= n then leave;
else
set item = t[offset(i)];
if item between startw and endw then 
set grp = k+1;
set buckets = if (
buckets = "", 
cast(grp as string), 
concat(buckets, ",", cast(grp as string))
);
else
set k = k+1;
set startw = t[offset(i)];
set endw = t[offset(i)] + delta;
set grp = k;
set buckets = if (
buckets = "", 
cast(grp as string), 
concat(buckets, ",", cast(grp as string))
);
end if;
set i = i+1;
end if;
end loop;
select y as active_window from unnest(split(buckets, ",")) as y

最新更新