我有一个事实表不断插入大量记录。它适用于另一个表中原始记录的 ML 预测结果。由于插入的频率非常高,因此在插入新批次的结果之前,我只想保留最后一批的预测结果。如何编写查询以删除较旧的记录,同时仅保留每个record_id的最新副本?
+--------+-----------+------------+------------+
| row_id | record_id | prediction | insert_ts |
+--------+-----------+------------+------------+
| 1 | 101 | cat | 2020-06-04 |
| 2 | 102 | dog | 2020-06-04 |
| 3 | 103 | tiger | 2020-06-04 |
| 4 | 101 | tiger | 2020-06-05 |
| 5 | 102 | lion | 2020-06-05 |
| 6 | 101 | dog | 2020-06-06 |
+--------+-----------+------------+------------+
运行查询后,表中剩余的内容应如下所示。
+--------+-----------+------------+------------+
| row_id | record_id | prediction | insert_ts |
+--------+-----------+------------+------------+
| 3 | 103 | tiger | 2020-06-04 |
| 5 | 102 | lion | 2020-06-05 |
| 6 | 101 | dog | 2020-06-06 |
+--------+-----------+------------+------------+
我发现一些帖子建议截断表格并插入最新的记录。但这会弄乱serial
并在其他表中用作外键引用的row_id。如何编写此delete
查询?
您可以使用row_number()
获取最新记录。您可以使用以下查询创建一个新表,该表将仅具有最新的日期记录。
这是演示。
select
row_id,
record_id,
prediction,
insert_ts
from
(
select
*,
row_number() over (partition by record_id order by insert_ts desc) as rnk
from myTable
) val
where rnk = 1
输出:
| row_id | record_id | prediction | insert_ts |
| ------ | --------- | ---------- | ---------- |
| 3 | 103 | tiger | 2020-06-04 |
| 5 | 102 | lion | 2020-06-05 |
| 6 | 101 | dog | 2020-06-06 |
您也可以在代码中处理此逻辑,其中现有record_id
的新更新,然后删除旧更新并插入新更新或覆盖它。
with keep_these as
(
select distinct first_value(row_id)
over (partition by record_id order by insert_ts desc) as rid
from tbl
)
delete from tbl where row_id not in (select rid from keep_these);
当一条记录的命运取决于其他记录的存在时,您可以使用EXISTS()
:
DELETE FROM the_table d
WHERE EXISTS ( -- a record exists
SELECT * FROM the_table x
WHERE x.record_id = d.record_id -- with the same record_id
AND x.insert_ts > d.insert_ts -- but with a newer timestamp
);