我正在处理一个包含工作日数据的表。这些数据基本上是关于每日收盘余额的。数据如下:
ID Name Some Val Other Val Date
10 Somebody 33001.93 33001.93 2018-10-01
10 Somebody 33481.93 33481.93 2018-10-02
10 Somebody 33001.93 33001.93 2018-10-03
10 Somebody 33582.76 33582.76 2018-10-04
10 Somebody 33582.73 33582.79 2018-10-05
------- Missing row for 2018-10-06 ---------------
------- Missing row for 2018-10-07 ---------------
10 Somebody 33582.76 33582.76 2018-10-08
------- Missing row for 2018-10-09 ---------------
10 Somebody 33462.76 33462.76 2018-10-10
我的任务是计算平均每日余额(每天结束时的总余额/总天数(。为了进行计算,我需要确保我有所有日子的数据。为此,最后一个可用行需要替换丢失的数据。
我的需求是:
ID Name Some Val Other Val Date
10 Somebody 33001.93 33001.93 2018-10-01
10 Somebody 33481.93 33481.93 2018-10-02
10 Somebody 33001.93 33001.93 2018-10-03
10 Somebody 33582.76 33582.76 2018-10-04
10 Somebody 33582.73 33582.79 2018-10-05
10 Somebody 33582.73 33582.79 2018-10-06
10 Somebody 33582.73 33582.79 2018-10-07
10 Somebody 33582.76 33582.76 2018-10-08
10 Somebody 33382.76 33582.76 2018-10-09
10 Somebody 33462.76 33462.76 2018-10-10
基本上,行5被写入缺失的行6和7,而行8被写入行9。
我通过创建一个日历表,然后使用以下查询获得了解决方案:
SELECT
CASE WHEN ID IS NULL THEN (SELECT ID
FROM T tt
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1)
ELSE ID END ID,
CASE WHEN Name IS NULL THEN (SELECT Name
FROM T tt
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1)
ELSE Name END Name,
CASE WHEN SomeVal IS NULL THEN (SELECT SomeVal
FROM T tt
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1)
ELSE SomeVal END SomeVal,
CASE WHEN OtherVal IS NULL THEN (SELECT OtherVal
FROM T tt
WHERE tt.Date < t1.minDt
ORDER BY tt.Date DESC
LIMIT 1)
ELSE OtherVal END OtherVal,
minDt
FROM calendar t1
LEFT JOIN T t2 ON t1.minDt = t2.Date
ORDER BY t1.minDT;
当ID值为常量时,此解决方案有效。我意识到我的数据集有数千条记录,其中有几百个唯一的id值。每个id都可能缺少值。上面的查询仅替换数据的顶部,而不是整个数据。我需要为每个id运行相同的查询。我想partitionby在mysql中有效,但我不太确定如何尝试。
数据实际上看起来像这样:
10,'Somebody',33001.93,33001.93,'2018-10-01'
10,'Somebody',33481.93,33481.93,'2018-10-02'
10,'Somebody',33001.93,33001.93,'2018-10-03'
10,'Somebody',33582.76,33582.76,'2018-10-04'
10,'Somebody',33582.73,33582.79,'2018-10-05'
10,'Somebody',33582.76,33582.76,'2018-10-08'
15,'someone else',33462.76,33462.76,'2018-10-1'
15,'someone else',33582.76,33582.76,'2018-10-04'
15,'someone else',33582.73,33582.79,'2018-10-05'
15,'someone else',33582.76,33582.76,'2018-10-08'
15,'someone else',33462.76,33462.76,'2018-10-10'
这里是您可以尝试使用伪数据和上面的查询的地方:
DB Fiddle 视图
我使用的MySQL版本是:
mysql Ver 14.14 Distrib 5.7.24, for Linux (x86_64) using EditLine wrapper
您可以使用MySQL变量来填充表数据。诀窍是将日历表JOIN
与表中不同的ID
值的列表进行比较,以便得到一个表,该表中的每个日期都有ID和日期。然后,可以将其LEFT JOIN
导出到数据表中,以获取存在的值,MySQL变量可以用来填补空白:
SELECT thedate,
@name := coalesce(Name, @name) AS Name,
@someval := coalesce(SomeVal, @someval) AS SomeVal,
@otherval := coalesce(OtherVal, @otherval) AS OtherVal,
@id := id AS id
FROM (SELECT c.thedate, i.id, t.Name, t.SomeVal, t.OtherVal
FROM calendar c
JOIN (SELECT DISTINCT id FROM t) i
LEFT JOIN t ON t.date = c.thedate AND t.id = i.id) g
CROSS JOIN (SELECT @id := 0, @name := '', @someval := 0, @otherval := 0) v
ORDER BY id, thedate
样本数据的输出:
thedate Name SomeVal OtherVal id
2018-10-01 Somebody 33001.93 33001.93 10
2018-10-02 Somebody 33481.93 33481.93 10
2018-10-03 Somebody 33001.93 33001.93 10
2018-10-04 Somebody 33582.76 33582.76 10
2018-10-05 Somebody 33582.73 33582.79 10
2018-10-06 Somebody 33582.73 33582.79 10
2018-10-07 Somebody 33582.73 33582.79 10
2018-10-08 Somebody 33582.76 33582.76 10
2018-10-09 Somebody 33582.76 33582.76 10
2018-10-10 Somebody 33582.76 33582.76 10
2018-10-01 someone else 33462.76 33462.76 15
2018-10-02 someone else 33462.76 33462.76 15
2018-10-03 someone else 33462.76 33462.76 15
2018-10-04 someone else 33582.76 33582.76 15
2018-10-05 someone else 33582.73 33582.79 15
2018-10-06 someone else 33582.73 33582.79 15
2018-10-07 someone else 33582.73 33582.79 15
2018-10-08 someone else 33582.76 33582.76 15
2018-10-09 someone else 33582.76 33582.76 15
2018-10-10 someone else 33462.76 33462.76 15
我在dbfiddle上创建了一个演示,展示了所有部分是如何组合在一起的(包括我的日历表,它只包括你表中的日期(。
我想我通过使用上面提到的相同逻辑获得了一些进展。必须使用id数据创建日历查找表。我正在匹配日期和身份级别。生成的表得到了很多重复/空的记录,但对数据的解压缩基本上满足了我的需求。
这当然不是最优雅的解决方案,因为我使用的临时数据集相当大。必须有一个更简洁的解决方案,但目前这对我有效。