在SQL中按组插入缺失日期的最佳方法



我有一个这样的数据集:

我们确实可以使用月份开始的日历表来解决这个问题,例如calendar(date)

我们可以在聚合子查询中定义每个帐户的日期范围,然后用日历表cross join它;这给了我们所有可能的日期/帐户元组。剩下要做的就是尝试将相应的行(如果有的话)带left join

select c.date, a.account, coalesce(t.spend, 0) spend
from (
select account, min(date) min_date, max(date) max_date 
from mytable
group by account
) a
inner join calendar c on c.date >= a.min_date and c.date <= a.max_date
left join mytable   t on t.date = c.date and t.account = a.account

步骤和方法:

  • 首先需要使用first_value函数查找每个组的最大和最小日期

  • 您需要一个包含唯一月份和所有唯一帐户的数据集-这就是cross join的由来。您可以相应地调整表的名称

  • 您需要将max_date和min_date值连接到每个帐户值,这需要过滤掉min_date超出实际数据集范围的行,就像您描述的日期2/1/21的帐户B的情况一样。要实现这一点,您只需要在键accounts = accounts

    上进行连接
  • 然后您需要根据日期和帐户(即m2)加入,以获得基于日期和帐户的消费值

  • 最后,您可以过滤掉在where子句

    中给定帐户的原始日期的最大值之后和最小值之前出现的行
with main as (
select 
date,
account,
coalesce(spend,0) as total_spend,
first_value(Date) over(partition by account order by date desc) as max_date,
first_value(Date) over(partition by account order by date) as min_date
from <table_name>
),
combining as (
-- make sure you have distinct accounts and months stored in these tables
select distinct accounts from <account_name_table>
cross join <calendar_month_table>
),
joining as (
select
c.date,
c.accounts,
coalesce(m2.total_spend,0) as new_spend,
main.max_date,
main.min_date
from combining
left join main
on combining.accounts = main.accounts
left join main as m2
on combining.accounts = m2.accounts
and combining.date = m2.date
)
select * from joining where min_date <= date and max_date >= date

最新更新