假设我有一个表,其中有一列为spend_date_id。为了简单起见,该列表示为自第一次花费开始以来的天数,所以我们有一个整数列。
样本数据:
spend_date_id
0
350
450
500
550
650
700 and so on
我需要将这些日期与一个活动窗口相关联,其中每个活动窗口的长度为从最早消费日期起200天。每次活动窗口结束时,下一个可用的消费日期都会开始一个新窗口。
因此,在我们的例子中,这看起来像
spend_date_id active_window
0 0
350 1
450 1
500 1
550 2
650 2
700 2 and so on
窗口0:从0开始一直到199。此窗口中的所有日期都将具有此窗口编号,即date_ids 0。
窗口1:从350开始,到549。此窗口中的所有日期都将具有此窗口编号,即date_ids 350、450、500。
请注意,我可以使用的SQL版本BigQuery不允许递归CTE。
下面是BigQuery标准SQL
#standardSQL
CREATE TEMP FUNCTION splits(arr ARRAY<INT64>, size INT64)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var cut = parseInt(arr[0]) + parseInt(size);
var result = [arr[0]];
for(i=1; i<arr.length; i++){
if (arr[i] >= cut) {
result.push(arr[i]);
cut = parseInt(arr[i]) + parseInt(size);
}
};
return result;
""";
WITH dirty_split AS (
SELECT spend_date_id, COUNTIF(flag) OVER(ORDER BY spend_date_id) new_big_window
FROM (
SELECT spend_date_id, spend_date_id - LAG(spend_date_id) OVER(ORDER BY spend_date_id) > 200 flag
FROM `project.dataset.table`
)
), precise_split AS (
SELECT new_big_window, splits(ARRAY_AGG(spend_date_id), 200) splits
FROM dirty_split
GROUP BY new_big_window
), numbering AS (
SELECT new_big_window, val, ROW_NUMBER() OVER(ORDER BY new_big_window, val) - 1 AS active_window
FROM precise_split, UNNEST(splits) val
)
SELECT spend_date_id, active_window
FROM `project.dataset.table` t
JOIN numbering n
ON t.spend_date_id BETWEEN n.val AND n.val + 200 - 1
-- ORDER BY spend_date_id
如果适用于您的问题中的样本数据,如下面的示例
#standardSQL
CREATE TEMP FUNCTION splits(arr ARRAY<INT64>, size INT64)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var cut = parseInt(arr[0]) + parseInt(size);
var result = [arr[0]];
for(i=1; i<arr.length; i++){
if (arr[i] >= cut) {
result.push(arr[i]);
cut = parseInt(arr[i]) + parseInt(size);
}
};
return result;
""";
WITH `project.dataset.table` AS (
SELECT * FROM UNNEST([0,350,450,500,550,650,700]) AS spend_date_id
), dirty_split AS (
SELECT spend_date_id, COUNTIF(flag) OVER(ORDER BY spend_date_id) new_big_window
FROM (
SELECT spend_date_id, spend_date_id - LAG(spend_date_id) OVER(ORDER BY spend_date_id) > 200 flag
FROM `project.dataset.table`
)
), precise_split AS (
SELECT new_big_window, splits(ARRAY_AGG(spend_date_id), 200) splits
FROM dirty_split
GROUP BY new_big_window
), numbering AS (
SELECT new_big_window, val, ROW_NUMBER() OVER(ORDER BY new_big_window, val) - 1 AS active_window
FROM precise_split, UNNEST(splits) val
)
SELECT spend_date_id, active_window
FROM `project.dataset.table` t
JOIN numbering n
ON t.spend_date_id BETWEEN n.val AND n.val + 200 - 1
ORDER BY spend_date_id
结果是
Row spend_date_id active_window
1 0 0
2 350 1
3 450 1
4 500 1
5 550 2
6 650 2
7 700 2
显然,以上内容取决于表中驱动要处理的数组大小的行数。为了帮助解决这个问题,你可以看到,首先我做dirty split
,我只是将彼此间隔超过200的组分开,因此没有机会在同一个活动窗口中。因此,这可能会降低最终precise split
的阵列尺寸
您也可以通过BigQuery:中的脚本功能来解决此问题
set delta = 199;
set t = array(select spend_date_id from dataset.table);
set buckets = '';
set n = array_length(t);
set i = 0;
set k = 0;
set startw = t[offset(k)];
set endw = t[offset(k)]+delta;
loop
if i >= n then leave;
else
set item = t[offset(i)];
if item between startw and endw then
set grp = k+1;
set buckets = if (
buckets = "",
cast(grp as string),
concat(buckets, ",", cast(grp as string))
);
else
set k = k+1;
set startw = t[offset(i)];
set endw = t[offset(i)] + delta;
set grp = k;
set buckets = if (
buckets = "",
cast(grp as string),
concat(buckets, ",", cast(grp as string))
);
end if;
set i = i+1;
end if;
end loop;
select y as active_window from unnest(split(buckets, ",")) as y