FIRST_VALUE和FULL OUTER JOIN用先前的记录填充空是不工作的



我们的Snowflake数据仓库中的源数据有" gap ";即失踪的日子,我的要求是填补缺失的数据,从前一行。因此,我创建了这个虚拟项目来练习使用完整的外连接和窗口函数FIRST_VALUE

我的示例数据有两个表:一个名为prev_test,一个名为dim_date
create TABLE prev_test (
SITE_ID VARCHAR(650),
SUBSCRIPTION_ID VARCHAR(650),
ORDER_CREATED DATE,
ORDER_TYPE VARCHAR(650),
SUBSCRIPTION_STATUS VARCHAR(650),
PERIOD_NORMALIZER VARCHAR(650),
CHANGE_MRR_EVENT_TYPE VARCHAR(650),
TOTAL integer,
DAILY_MRR integer
);
INSERT INTO prev_test 
VALUES('AB', '123', '2021-09-17', 'PRORATED', 'ACTIVE', '1M', 'New', 60, 2);
INSERT INTO prev_test 
VALUES('AB', '123', '2021-09-20', 'PRORATED', 'ACTIVE', '1M', 'New', 30, 10);
create TABLE dim_date (
date_key date
);
INSERT INTO dim_date 
VALUES('2021-09-17');
INSERT INTO dim_date 
VALUES('2021-09-18');
INSERT INTO dim_date 
VALUES('2021-09-19');
INSERT INTO dim_date 
VALUES('2021-09-20');

当我这样选择我的数据

SELECT 
CASE WHEN SITE_ID IS NULL THEN FIRST_VALUE(SITE_ID) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SITE_ID END AS SITE_ID , 
CASE WHEN SUBSCRIPTION_ID IS NULL THEN FIRST_VALUE(SUBSCRIPTION_ID) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SUBSCRIPTION_ID END AS SUBSCRIPTION_ID , 
CASE WHEN ORDER_CREATED IS NULL THEN dd.DATE_KEY ELSE ORDER_CREATED END AS ORDER_CREATED, 
CASE WHEN ORDER_TYPE IS NULL THEN FIRST_VALUE(ORDER_TYPE) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE ORDER_TYPE END AS ORDER_TYPE , 
CASE WHEN SUBSCRIPTION_STATUS IS NULL THEN FIRST_VALUE(SUBSCRIPTION_STATUS) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SUBSCRIPTION_STATUS END AS SUBSCRIPTION_STATUS , 
CASE WHEN PERIOD_NORMALIZER IS NULL THEN FIRST_VALUE(PERIOD_NORMALIZER) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE PERIOD_NORMALIZER END AS PERIOD_NORMALIZER , 
CASE WHEN CHANGE_MRR_EVENT_TYPE IS NULL THEN FIRST_VALUE(CHANGE_MRR_EVENT_TYPE) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE CHANGE_MRR_EVENT_TYPE END AS CHANGE_MRR_EVENT_TYPE , 
CASE WHEN TOTAL IS NULL THEN FIRST_VALUE(TOTAL) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE TOTAL END AS TOTAL , 
CASE WHEN DAILY_MRR IS NULL THEN FIRST_VALUE(DAILY_MRR) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE DAILY_MRR END AS DAILY_MRR  
FROM prev_test
FULL OUTER JOIN DIM_DATE dd ON ORDER_CREATED = dd.DATE_KEY 
ORDER BY 3 asc

这给出了我想要的输出。

然而,一旦我将相同的代码应用到我的生产数据(与我的示例相同的结构),它就不能工作了。缺少的天数没有被填满,FIRST_VALUE没有"重复";数据转到以下记录。没有错误。

这是我的"real"数据


SITE_ID    |SUBSCRIPTION_ID       |ORDER_CREATED|ORDER_TYPE     |SUBSCRIPTION_STATUS|PERIOD_NORMALIZER|CHANGE_MRR_EVENT_TYPE  |TOTAL   |DAILY_MRR     |
L22|JriInfs|   2021-06-02|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|
L22|JriInfs|   2021-09-17|PRORATED|Active          |1M               |New             |209.0000|209.000000|
L22|JriInfs|   2021-09-30|PRORATED|Active          |1M               |Changed         |269.0000|269.000000|
L22|JriInfs|   2021-10-08|PRORATED|Active          |1M               |Downgraded From|  0.0000|  0.000000|

这是我想要的输出(**是缺少的",我希望这对你有帮助:


SITE_ID    |SUBSCRIPTION_ID       |ORDER_CREATED|ORDER_TYPE     |SUBSCRIPTION_STATUS|PERIOD_NORMALIZER|CHANGE_MRR_EVENT_TYPE  |TOTAL   |DAILY_MRR     |
L22|JriInfs|   2021-06-02|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|
**L22|JriInfs|   2021-06-03|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|**
…
**L22|JriInfs|   2021-09-16|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|**
L22|JriInfs|   2021-09-17|PRORATED|Active          |1M               |New             |209.0000|209.000000|
**L22|JriInfs|   2021-09-18|PRORATED|Active          |1M               |New             |209.0000|209.000000|**
…
**L22|JriInfs|   2021-09-29|PRORATED|Active          |1M               |New             |209.0000|209.000000|**
L22|JriInfs|   2021-09-30|PRORATED|Active          |1M               |Changed         |269.0000|269.000000|
...

我错过了什么?

您的查询的问题是您是基于ORDER_CREATED订购的。这里不使用转换后的值,相反,对于缺失的行,值是NULL,因此FIRST_VALUE的结果为NULL。你应该通过DATE_KEY订购。但是,通过DATE_KEY排序仍然会获得不正确的结果,因为缺失的行总是用第一行的值填充。在您的示例中,如果您有另一个插入

INSERT INTO prev_test 
VALUES('AB', '123', '2021-09-23', 'PRORATED', 'ACTIVE', '1M', 'New', 30, 10);

日期21和22的行将从日期17获取值(这是不正确的),而不是从日期20获取值。

我尝试了下面的查询,它工作。您可以将其修改为使用FIRST_VALUE

select
case when site_id is not null then site_id 
else max(site_id) over(partition by c_val order by order_created) end as site_id,
case when SUBSCRIPTION_ID is not null then SUBSCRIPTION_ID
else max(SUBSCRIPTION_ID) over(partition by c_val order by order_created) end as SUBSCRIPTION_ID,
order_created
from
(select 
*,
sum(val)over(order by order_created rows unbounded preceding) as c_val from
(
select
site_id,
subscription_id,
coalesce(a.order_created,dd.DATE_KEY) as order_created,
ORDER_TYPE,
SUBSCRIPTION_STATUS,
PERIOD_NORMALIZER,
CHANGE_MRR_EVENT_TYPE,
TOTAL,
DAILY_MRR,
case when a.order_created is null then 0 else 1 end as val
FROM prev_test a
FULL OUTER JOIN DIM_DATE dd ON a.ORDER_CREATED = dd.DATE_KEY
WHERE  dd.DATE_KEY >=(select min(a.ORDER_CREATED) from prev_test)
)tbl) x

db fiddle: http://sqlfiddle.com/#!18/c7dc8/3/0

编辑:增加了过滤日期起始点的条件。

这就是我的问题的解决方案


SELECT DISTINCT
CASE 
WHEN srm.SITE_ID IS NULL THEN FIRST_VALUE(srm.SITE_ID) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SITE_ID END AS SITE_ID, 
CASE WHEN srm.SUBSCRIPTION_ID IS NULL THEN FIRST_VALUE(SUBSCRIPTION_ID) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SUBSCRIPTION_ID END AS SUBSCRIPTION_ID , 
DD.DATE_KEY AS ORDER_CREATED,
CASE WHEN srm.ORDER_TYPE IS NULL THEN FIRST_VALUE(srm.ORDER_TYPE) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.ORDER_TYPE END AS ORDER_TYPE , 
CASE WHEN srm.SUBSCRIPTION_STATUS IS NULL THEN FIRST_VALUE(SUBSCRIPTION_STATUS) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SUBSCRIPTION_STATUS END AS SUBSCRIPTION_STATUS , 
CASE WHEN srm.PERIOD_NORMALIZER IS NULL THEN FIRST_VALUE(PERIOD_NORMALIZER) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.PERIOD_NORMALIZER END AS PERIOD_NORMALIZER , 
CASE WHEN srm.CHANGE_MRR_EVENT_TYPE IS NULL THEN FIRST_VALUE(CHANGE_MRR_EVENT_TYPE) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.CHANGE_MRR_EVENT_TYPE END AS CHANGE_MRR_EVENT_TYPE , 
CASE WHEN srm.TOTAL IS NULL THEN FIRST_VALUE(TOTAL) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.TOTAL END AS TOTAL , 
CASE WHEN srm.DAILY_MRR IS NULL THEN FIRST_VALUE(DAILY_MRR) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.DAILY_MRR END AS DAILY_MRR  
FROM DIM_DATE dd
LEFT JOIN SUBSCRIPTION_REVENUE_MART srm 
ON  srm.ORDER_CREATED = dd.DATE_KEY
ORDER BY dd.DATE_KEY;

变化如下:

  • 连接从FULL变为LEFT
  • 表的顺序被颠倒了
  • 在SELECT子句
  • 中提取DATE_KEY,而不是提取ORDER_CREATED。

相关内容

  • 没有找到相关文章

最新更新