使用与周围行的数据的间隙距离成比例的值来填充数据中的空白?



很快,我将不得不准备一份日期商品价格清单。粒度为 1 天,在商品销售的日子里,我将平均价格以获得当天的平均值。有些日子没有销售,我适合通过拉动上一次和下一次出现的销售来使用足够的近似值,并且它们之间的每一天都有一个线性从一到另一的价格。

假设原始数据是:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-11 150

我可以到达这里:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-03 NULL
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-02 NULL
Sugar  2000-01-03 NULL
Sugar  2000-01-04 NULL
Sugar  2000-01-05 NULL
Sugar  2000-01-06 NULL
Sugar  2000-01-07 NULL
Sugar  2000-01-08 NULL
Sugar  2000-01-09 NULL
Sugar  2000-01-10 NULL
Sugar  2000-01-11 150

我想去的地方是:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-03 9.3 --being 9.5 + ((9.1 - 9.5 / 2) * 1)
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-02 105 --being 100 + (150 - 100 / 10) * 1)
Sugar  2000-01-03 110 --being 100 + (150 - 100 / 10) * 2)
Sugar  2000-01-04 115
Sugar  2000-01-05 120
Sugar  2000-01-06 125
Sugar  2000-01-07 130
Sugar  2000-01-08 135
Sugar  2000-01-09 140
Sugar  2000-01-10 145 --being 100 + (150 - 100 / 10) * 9)
Sugar  2000-01-11 150

到目前为止,我尝试了什么?只思考;我打算做这样的事情:

  • 拉取原始数据
  • 联接到数字/日历表以填充稀疏数据
  • LAST_VALUE(((还是第一个?在行上 无界的前后(带有 nulls-last order 子句(以从原始数据中获取第一个非空preceding_date、following_date、preceding_price和following_price
  • DATEDIFF 假日期和获得天数的preceding_date(这实际上是我们跨越差距的距离,gap_progress(和差距距离(following_date - preceding_date(
  • 获取公式的下一个价格、前一个价格和差距距离 (preceding_price + ((next_price - preceding_price(/gap_distance( * gap_progress(

然而,我想知道是否有更简单的方法,因为我有数百万个项目日,而且感觉它不会那么有效。

我发现很多问题的例子,其中最后一行或下一行的数据被逐字涂抹以填补空白,但我不记得看到过尝试某种过渡的情况。也许这种技术可以加倍应用,通过向前延伸的涂抹,复制最新的值,然后随之而来的是向后延伸的涂抹:

Item   Date       DateFwd    DateBak     PriceF PriceB
Bread  2000-01-01 2000-01-01 2000-01-01  10     10
Bread  2000-01-02 2000-01-02 2000-01-02  9.5    9.5
Bread  2000-01-03 2000-01-02 2000-01-04  9.5    9.1
Bread  2000-01-04 2000-01-04 2000-01-04  9.1    9.1
Sugar  2000-01-01 2000-01-01 2000-01-01  100    100
Sugar  2000-01-02 2000-01-01 2000-01-11  100    150
Sugar  2000-01-03 2000-01-01 2000-01-11  100    150
Sugar  2000-01-04 2000-01-01 2000-01-11  100    150
Sugar  2000-01-05 2000-01-01 2000-01-11  100    150
Sugar  2000-01-06 2000-01-01 2000-01-11  100    150
Sugar  2000-01-07 2000-01-01 2000-01-11  100    150
Sugar  2000-01-08 2000-01-01 2000-01-11  100    150
Sugar  2000-01-09 2000-01-01 2000-01-11  100    150
Sugar  2000-01-10 2000-01-01 2000-01-11  100    150
Sugar  2000-01-11 2000-01-11 2000-01-11  150    150

这些可能会为公式提供必要的数据(preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress)

  • gap_distance = DATEDIFF(day, DateFwd, DateBak(
  • gap_progress = 日期差异(日, 日期, 日期Fwd(
  • next_price = 价格B
  • preceding_price = 价格F

这是我知道我可以访问的数据的 DDL(与日历表连接的原始数据(

CREATE TABLE Data
([I] varchar(5), [D] date, [P] DECIMAL(10,5))
;
INSERT Data
([I], [D], [P])
VALUES
('Bread', '2000-01-01', 10),
('Bread', '2000-01-02', 9.5),
('Bread', '2000-01-04', 9.1),
('Sugar', '2000-01-01', 100),
('Sugar', '2000-01-11', 150);
CREATE TABLE Cal([D] DATE);
INSERT Cal VALUES
('2000-01-01'),
('2000-01-02'),
('2000-01-03'),
('2000-01-04'),
('2000-01-05'),
('2000-01-06'),
('2000-01-07'),
('2000-01-08'),
('2000-01-09'),
('2000-01-10'),
('2000-01-11');
SELECT d.i as [item], c.d as [date], d.p as [price] FROM
cal c LEFT JOIN data d ON c.d = d.d

您可以使用OUTER APPLY获取价格不为 null 的上一行和下一行:

select
d.item,
d.date,
case when d.price is null then
prev.price + ( (next.price - prev.price) /
datediff(day, prev.date, next.date) *
datediff(day, prev.date, d.date)
)
else
d.price
end as price
from data d
outer apply
(
select top(1) *
from data d2
where d2.item = d.item and d2.date < d.date and d2.price is not null
order by d2.date desc
) prev
outer apply
(
select top(1) *
from data d2
where d2.item = d.item and d2.date > d.date and d2.price is not null
order by d2.date
) next;

Rextester演示:http://rextester.com/QBL7472

更新:这可能很慢。也许向子查询中的 where 子句添加and d.price is null会有所帮助,以向 DBMS 表明,当价格不为空时,它不必实际查找其他记录。只需检查解释计划,看看是否有帮助。

更容易一次性生成那些缺失的缺口和价格

所以我从你的原始数据开始

CREATE TABLE t
([I] varchar(5), [D] date, [P] DECIMAL(10,2))
;
INSERT INTO t
([I], [D], [P])
VALUES
('Bread', '2000-01-01 00:00:00', '10'),
('Bread', '2000-01-02 00:00:00', '9.5'),
('Bread', '2000-01-04 00:00:00', '9.1'),
('Sugar', '2000-01-01 00:00:00', '100'),
('Sugar', '2000-01-11 00:00:00', '150');
; with
-- number is a tally table. here i use recursive cte to generate 100 numbers
number as
(
select  n = 0
union all
select  n = n + 1
from    number
where   n < 99
),
-- a cte to get the Price of next date and also day diff
cte as
(
select  *, 
nextP = lead(P) over(partition by I order by D),
cnt = datediff(day, D, lead(D) over(partition by I order by D)) - 1
from    t
) 
select  I, 
D = dateadd(day, n, D), 
P = coalesce(c.P + (c.nextP - c.P) / ( cnt + 1) * n, c.P)
from    cte c
cross join number n
where   n.n <= isnull(c.cnt, 0)
drop table t

我会将您的公式100 + (150 - 100 / 10) * 9)等放入标量 UDF 中,并在持久化的计算列中使用它。

这将适用于sql-server-2012+ 测试表:

DECLARE @t table
(Item char(5), Date date, Price decimal(9,1))
INSERT @t values
('Bread','2000-01-01', 10),
('Bread','2000-01-02',  9.5),
('Bread','2000-01-04',  9.1),
('Sugar','2000-01-01',  100),
('Sugar','2000-01-11',  150)

查询

;WITH CTE as
(
SELECT
Item, Date, Price,
lead(price) over(partition by Item order by Date) nextprice,
lead(Date) over(partition by Item order by Date) nextDate
FROM @t
), N(N) as
(
SELECT 1 FROM(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))M(N)
), tally(N) as
(
SELECT ROW_NUMBER()OVER(ORDER BY N.N)FROM N,N a,N b,N c,N d,N e,N f
)
SELECT 
dateadd(d, coalesce(r, 0), Date) Date,
Item, 
CAST(price + coalesce((nextprice-price) * r 
/ datediff(d, date, nextdate), 0) as decimal(10,1)) Price
FROM CTE
OUTER APPLY
(
SELECT top(coalesce(datediff(d, date, nextdate), 0)) 
row_number() over (order by (select 1))-1 r
FROM N
) z
ORDER BY item, date

结果:

Date    Item    Price
2000-01-01  Bread   10.0
2000-01-02  Bread   9.5
2000-01-03  Bread   9.3
2000-01-04  Bread   9.1
2000-01-01  Sugar   100.0
2000-01-02  Sugar   105.0
2000-01-03  Sugar   110.0
2000-01-04  Sugar   115.0
2000-01-05  Sugar   120.0
2000-01-06  Sugar   125.0
2000-01-07  Sugar   130.0
2000-01-08  Sugar   135.0
2000-01-09  Sugar   140.0
2000-01-10  Sugar   145.0
2000-01-11  Sugar   150.0

最新更新