为每个ID子组在时间序列数据表中添加缺少的日期记录



我正在处理一个包含工作日数据的表。这些数据基本上是关于每日收盘余额的。数据如下:

ID  Name        Some Val    Other Val   Date
10  Somebody    33001.93    33001.93    2018-10-01
10  Somebody    33481.93    33481.93    2018-10-02
10  Somebody    33001.93    33001.93    2018-10-03
10  Somebody    33582.76    33582.76    2018-10-04
10  Somebody    33582.73    33582.79    2018-10-05
------- Missing row for 2018-10-06 ---------------
------- Missing row for 2018-10-07 ---------------
10  Somebody    33582.76    33582.76    2018-10-08
------- Missing row for 2018-10-09 ---------------
10  Somebody    33462.76    33462.76    2018-10-10

我的任务是计算平均每日余额(每天结束时的总余额/总天数(。为了进行计算,我需要确保我有所有日子的数据。为此,最后一个可用行需要替换丢失的数据。

我的需求是:

ID  Name        Some Val    Other Val   Date
10  Somebody    33001.93    33001.93    2018-10-01
10  Somebody    33481.93    33481.93    2018-10-02
10  Somebody    33001.93    33001.93    2018-10-03
10  Somebody    33582.76    33582.76    2018-10-04
10  Somebody    33582.73    33582.79    2018-10-05    
10  Somebody    33582.73    33582.79    2018-10-06
10  Somebody    33582.73    33582.79    2018-10-07    
10  Somebody    33582.76    33582.76    2018-10-08
10  Somebody    33382.76    33582.76    2018-10-09
10  Somebody    33462.76    33462.76    2018-10-10

基本上,行5被写入缺失的行6和7,而行8被写入行9。

我通过创建一个日历表,然后使用以下查询获得了解决方案:

SELECT  
CASE WHEN ID IS NULL THEN (SELECT ID 
FROM T tt 
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1)  
ELSE ID END ID,
CASE WHEN Name IS NULL THEN (SELECT Name 
FROM T tt 
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1) 
ELSE Name END Name,
CASE WHEN SomeVal IS NULL THEN (SELECT SomeVal 
FROM T tt 
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1) 
ELSE SomeVal END SomeVal,
CASE WHEN OtherVal IS NULL THEN (SELECT OtherVal 
FROM T tt 
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1) 
ELSE OtherVal END OtherVal,
minDt
FROM calendar t1 
LEFT JOIN T t2 ON t1.minDt = t2.Date
ORDER BY t1.minDT;

当ID值为常量时,此解决方案有效。我意识到我的数据集有数千条记录,其中有几百个唯一的id值。每个id都可能缺少值。上面的查询仅替换数据的顶部,而不是整个数据。我需要为每个id运行相同的查询。我想partitionby在mysql中有效,但我不太确定如何尝试。

数据实际上看起来像这样:

10,'Somebody',33001.93,33001.93,'2018-10-01'
10,'Somebody',33481.93,33481.93,'2018-10-02'
10,'Somebody',33001.93,33001.93,'2018-10-03'
10,'Somebody',33582.76,33582.76,'2018-10-04'
10,'Somebody',33582.73,33582.79,'2018-10-05'
10,'Somebody',33582.76,33582.76,'2018-10-08'
15,'someone else',33462.76,33462.76,'2018-10-1'
15,'someone else',33582.76,33582.76,'2018-10-04'
15,'someone else',33582.73,33582.79,'2018-10-05'
15,'someone else',33582.76,33582.76,'2018-10-08'
15,'someone else',33462.76,33462.76,'2018-10-10'

这里是您可以尝试使用伪数据和上面的查询的地方:

DB Fiddle 视图

我使用的MySQL版本是:

mysql  Ver 14.14 Distrib 5.7.24, for Linux (x86_64) using  EditLine wrapper

您可以使用MySQL变量来填充表数据。诀窍是将日历表JOIN与表中不同的ID值的列表进行比较,以便得到一个表,该表中的每个日期都有ID和日期。然后,可以将其LEFT JOIN导出到数据表中,以获取存在的值,MySQL变量可以用来填补空白:

SELECT thedate,
@name := coalesce(Name, @name) AS Name,
@someval := coalesce(SomeVal, @someval) AS SomeVal,
@otherval := coalesce(OtherVal, @otherval) AS OtherVal,
@id := id AS id
FROM (SELECT c.thedate, i.id, t.Name, t.SomeVal, t.OtherVal
FROM calendar c
JOIN (SELECT DISTINCT id FROM t) i
LEFT JOIN t ON t.date = c.thedate AND t.id = i.id) g
CROSS JOIN (SELECT @id := 0, @name := '', @someval := 0, @otherval := 0) v
ORDER BY id, thedate

样本数据的输出:

thedate     Name            SomeVal     OtherVal    id
2018-10-01  Somebody        33001.93    33001.93    10
2018-10-02  Somebody        33481.93    33481.93    10
2018-10-03  Somebody        33001.93    33001.93    10
2018-10-04  Somebody        33582.76    33582.76    10
2018-10-05  Somebody        33582.73    33582.79    10
2018-10-06  Somebody        33582.73    33582.79    10
2018-10-07  Somebody        33582.73    33582.79    10
2018-10-08  Somebody        33582.76    33582.76    10
2018-10-09  Somebody        33582.76    33582.76    10
2018-10-10  Somebody        33582.76    33582.76    10
2018-10-01  someone else    33462.76    33462.76    15
2018-10-02  someone else    33462.76    33462.76    15
2018-10-03  someone else    33462.76    33462.76    15
2018-10-04  someone else    33582.76    33582.76    15
2018-10-05  someone else    33582.73    33582.79    15
2018-10-06  someone else    33582.73    33582.79    15
2018-10-07  someone else    33582.73    33582.79    15
2018-10-08  someone else    33582.76    33582.76    15
2018-10-09  someone else    33582.76    33582.76    15
2018-10-10  someone else    33462.76    33462.76    15

我在dbfiddle上创建了一个演示,展示了所有部分是如何组合在一起的(包括我的日历表,它只包括你表中的日期(。

我想我通过使用上面提到的相同逻辑获得了一些进展。必须使用id数据创建日历查找表。我正在匹配日期和身份级别。生成的表得到了很多重复/空的记录,但对数据的解压缩基本上满足了我的需求。

这当然不是最优雅的解决方案,因为我使用的临时数据集相当大。必须有一个更简洁的解决方案,但目前这对我有效。

最新更新