发布如何基于列保留最新副本并删除旧记录



我有一个事实表不断插入大量记录。它适用于另一个表中原始记录的 ML 预测结果。由于插入的频率非常高,因此在插入新批次的结果之前,我只想保留最后一批的预测结果。如何编写查询以删除较旧的记录,同时仅保留每个record_id的最新副本?

+--------+-----------+------------+------------+
| row_id | record_id | prediction | insert_ts  |
+--------+-----------+------------+------------+
|      1 |       101 | cat        | 2020-06-04 |
|      2 |       102 | dog        | 2020-06-04 |
|      3 |       103 | tiger      | 2020-06-04 |
|      4 |       101 | tiger      | 2020-06-05 |
|      5 |       102 | lion       | 2020-06-05 |
|      6 |       101 | dog        | 2020-06-06 |
+--------+-----------+------------+------------+

运行查询后,表中剩余的内容应如下所示。

+--------+-----------+------------+------------+
| row_id | record_id | prediction | insert_ts  |
+--------+-----------+------------+------------+
|      3 |       103 | tiger      | 2020-06-04 |
|      5 |       102 | lion       | 2020-06-05 |
|      6 |       101 | dog        | 2020-06-06 |
+--------+-----------+------------+------------+

我发现一些帖子建议截断表格并插入最新的记录。但这会弄乱serial并在其他表中用作外键引用的row_id。如何编写此delete查询?

您可以使用row_number()获取最新记录。您可以使用以下查询创建一个新表,该表将仅具有最新的日期记录。

这是演示。

select
row_id,
record_id,
prediction,
insert_ts
from
(
select
*,
row_number() over (partition by record_id order by insert_ts desc) as rnk
from myTable
) val
where rnk = 1

输出:

| row_id | record_id | prediction | insert_ts  |
| ------ | --------- | ---------- | ---------- |
| 3      | 103       | tiger      | 2020-06-04 |
| 5      | 102       | lion       | 2020-06-05 |
| 6      | 101       | dog        | 2020-06-06 |

您也可以在代码中处理此逻辑,其中现有record_id的新更新,然后删除旧更新并插入新更新或覆盖它。

with keep_these as
(
select distinct first_value(row_id)
over (partition by record_id order by insert_ts desc) as rid 
from tbl
) 
delete from tbl where row_id not in (select rid from keep_these);

当一条记录的命运取决于其他记录的存在时,您可以使用EXISTS()


DELETE FROM the_table d
WHERE EXISTS (                      -- a record exists
SELECT * FROM the_table x
WHERE x.record_id = d.record_id -- with the same record_id
AND x.insert_ts  > d.insert_ts  -- but with a newer timestamp
); 

最新更新