我们的Snowflake数据仓库中的源数据有" gap ";即失踪的日子,我的要求是填补缺失的数据,从前一行。因此,我创建了这个虚拟项目来练习使用完整的外连接和窗口函数FIRST_VALUE
。
create TABLE prev_test (
SITE_ID VARCHAR(650),
SUBSCRIPTION_ID VARCHAR(650),
ORDER_CREATED DATE,
ORDER_TYPE VARCHAR(650),
SUBSCRIPTION_STATUS VARCHAR(650),
PERIOD_NORMALIZER VARCHAR(650),
CHANGE_MRR_EVENT_TYPE VARCHAR(650),
TOTAL integer,
DAILY_MRR integer
);
INSERT INTO prev_test
VALUES('AB', '123', '2021-09-17', 'PRORATED', 'ACTIVE', '1M', 'New', 60, 2);
INSERT INTO prev_test
VALUES('AB', '123', '2021-09-20', 'PRORATED', 'ACTIVE', '1M', 'New', 30, 10);
create TABLE dim_date (
date_key date
);
INSERT INTO dim_date
VALUES('2021-09-17');
INSERT INTO dim_date
VALUES('2021-09-18');
INSERT INTO dim_date
VALUES('2021-09-19');
INSERT INTO dim_date
VALUES('2021-09-20');
当我这样选择我的数据
SELECT
CASE WHEN SITE_ID IS NULL THEN FIRST_VALUE(SITE_ID) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SITE_ID END AS SITE_ID ,
CASE WHEN SUBSCRIPTION_ID IS NULL THEN FIRST_VALUE(SUBSCRIPTION_ID) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SUBSCRIPTION_ID END AS SUBSCRIPTION_ID ,
CASE WHEN ORDER_CREATED IS NULL THEN dd.DATE_KEY ELSE ORDER_CREATED END AS ORDER_CREATED,
CASE WHEN ORDER_TYPE IS NULL THEN FIRST_VALUE(ORDER_TYPE) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE ORDER_TYPE END AS ORDER_TYPE ,
CASE WHEN SUBSCRIPTION_STATUS IS NULL THEN FIRST_VALUE(SUBSCRIPTION_STATUS) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SUBSCRIPTION_STATUS END AS SUBSCRIPTION_STATUS ,
CASE WHEN PERIOD_NORMALIZER IS NULL THEN FIRST_VALUE(PERIOD_NORMALIZER) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE PERIOD_NORMALIZER END AS PERIOD_NORMALIZER ,
CASE WHEN CHANGE_MRR_EVENT_TYPE IS NULL THEN FIRST_VALUE(CHANGE_MRR_EVENT_TYPE) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE CHANGE_MRR_EVENT_TYPE END AS CHANGE_MRR_EVENT_TYPE ,
CASE WHEN TOTAL IS NULL THEN FIRST_VALUE(TOTAL) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE TOTAL END AS TOTAL ,
CASE WHEN DAILY_MRR IS NULL THEN FIRST_VALUE(DAILY_MRR) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE DAILY_MRR END AS DAILY_MRR
FROM prev_test
FULL OUTER JOIN DIM_DATE dd ON ORDER_CREATED = dd.DATE_KEY
ORDER BY 3 asc
这给出了我想要的输出。
然而,一旦我将相同的代码应用到我的生产数据(与我的示例相同的结构),它就不能工作了。缺少的天数没有被填满,FIRST_VALUE没有"重复";数据转到以下记录。没有错误。
这是我的"real"数据
SITE_ID |SUBSCRIPTION_ID |ORDER_CREATED|ORDER_TYPE |SUBSCRIPTION_STATUS|PERIOD_NORMALIZER|CHANGE_MRR_EVENT_TYPE |TOTAL |DAILY_MRR |
L22|JriInfs| 2021-06-02|PRORATED|Active |1M |Upgraded From | 0.0000| 0.000000|
L22|JriInfs| 2021-09-17|PRORATED|Active |1M |New |209.0000|209.000000|
L22|JriInfs| 2021-09-30|PRORATED|Active |1M |Changed |269.0000|269.000000|
L22|JriInfs| 2021-10-08|PRORATED|Active |1M |Downgraded From| 0.0000| 0.000000|
这是我想要的输出(**是缺少的",我希望这对你有帮助:
SITE_ID |SUBSCRIPTION_ID |ORDER_CREATED|ORDER_TYPE |SUBSCRIPTION_STATUS|PERIOD_NORMALIZER|CHANGE_MRR_EVENT_TYPE |TOTAL |DAILY_MRR |
L22|JriInfs| 2021-06-02|PRORATED|Active |1M |Upgraded From | 0.0000| 0.000000|
**L22|JriInfs| 2021-06-03|PRORATED|Active |1M |Upgraded From | 0.0000| 0.000000|**
…
**L22|JriInfs| 2021-09-16|PRORATED|Active |1M |Upgraded From | 0.0000| 0.000000|**
L22|JriInfs| 2021-09-17|PRORATED|Active |1M |New |209.0000|209.000000|
**L22|JriInfs| 2021-09-18|PRORATED|Active |1M |New |209.0000|209.000000|**
…
**L22|JriInfs| 2021-09-29|PRORATED|Active |1M |New |209.0000|209.000000|**
L22|JriInfs| 2021-09-30|PRORATED|Active |1M |Changed |269.0000|269.000000|
...
我错过了什么?
您的查询的问题是您是基于ORDER_CREATED
订购的。这里不使用转换后的值,相反,对于缺失的行,值是NULL
,因此FIRST_VALUE
的结果为NULL。你应该通过DATE_KEY
订购。但是,通过DATE_KEY
排序仍然会获得不正确的结果,因为缺失的行总是用第一行的值填充。在您的示例中,如果您有另一个插入
INSERT INTO prev_test
VALUES('AB', '123', '2021-09-23', 'PRORATED', 'ACTIVE', '1M', 'New', 30, 10);
日期21和22的行将从日期17获取值(这是不正确的),而不是从日期20获取值。
我尝试了下面的查询,它工作。您可以将其修改为使用FIRST_VALUE
select
case when site_id is not null then site_id
else max(site_id) over(partition by c_val order by order_created) end as site_id,
case when SUBSCRIPTION_ID is not null then SUBSCRIPTION_ID
else max(SUBSCRIPTION_ID) over(partition by c_val order by order_created) end as SUBSCRIPTION_ID,
order_created
from
(select
*,
sum(val)over(order by order_created rows unbounded preceding) as c_val from
(
select
site_id,
subscription_id,
coalesce(a.order_created,dd.DATE_KEY) as order_created,
ORDER_TYPE,
SUBSCRIPTION_STATUS,
PERIOD_NORMALIZER,
CHANGE_MRR_EVENT_TYPE,
TOTAL,
DAILY_MRR,
case when a.order_created is null then 0 else 1 end as val
FROM prev_test a
FULL OUTER JOIN DIM_DATE dd ON a.ORDER_CREATED = dd.DATE_KEY
WHERE dd.DATE_KEY >=(select min(a.ORDER_CREATED) from prev_test)
)tbl) x
db fiddle: http://sqlfiddle.com/#!18/c7dc8/3/0
编辑:增加了过滤日期起始点的条件。
这就是我的问题的解决方案
SELECT DISTINCT
CASE
WHEN srm.SITE_ID IS NULL THEN FIRST_VALUE(srm.SITE_ID) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SITE_ID END AS SITE_ID,
CASE WHEN srm.SUBSCRIPTION_ID IS NULL THEN FIRST_VALUE(SUBSCRIPTION_ID) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SUBSCRIPTION_ID END AS SUBSCRIPTION_ID ,
DD.DATE_KEY AS ORDER_CREATED,
CASE WHEN srm.ORDER_TYPE IS NULL THEN FIRST_VALUE(srm.ORDER_TYPE) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.ORDER_TYPE END AS ORDER_TYPE ,
CASE WHEN srm.SUBSCRIPTION_STATUS IS NULL THEN FIRST_VALUE(SUBSCRIPTION_STATUS) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SUBSCRIPTION_STATUS END AS SUBSCRIPTION_STATUS ,
CASE WHEN srm.PERIOD_NORMALIZER IS NULL THEN FIRST_VALUE(PERIOD_NORMALIZER) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.PERIOD_NORMALIZER END AS PERIOD_NORMALIZER ,
CASE WHEN srm.CHANGE_MRR_EVENT_TYPE IS NULL THEN FIRST_VALUE(CHANGE_MRR_EVENT_TYPE) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.CHANGE_MRR_EVENT_TYPE END AS CHANGE_MRR_EVENT_TYPE ,
CASE WHEN srm.TOTAL IS NULL THEN FIRST_VALUE(TOTAL) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.TOTAL END AS TOTAL ,
CASE WHEN srm.DAILY_MRR IS NULL THEN FIRST_VALUE(DAILY_MRR) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.DAILY_MRR END AS DAILY_MRR
FROM DIM_DATE dd
LEFT JOIN SUBSCRIPTION_REVENUE_MART srm
ON srm.ORDER_CREATED = dd.DATE_KEY
ORDER BY dd.DATE_KEY;
变化如下:
- 连接从FULL变为LEFT
- 表的顺序被颠倒了
- 在SELECT子句 中提取DATE_KEY,而不是提取ORDER_CREATED。