用标准删除每个组的多余行



我有一个Postgres表,有这些列:

id          int8
, user_id     varchar
, is_favorite boolean
, join_time   timestamptz

我想在某些条件下删除这个表中的一些行:

  • 每个user_id最多保留10行。
  • 这10行必须包含每个user_idis_favorite=true行(每个user_id不能超过5行is_favorite=true)
  • 其余的10行必须是最新的join_time

我想删除这个表中每个user_id超过10的行,并且有多个user_id条件。

有一个题目没有user_id标准,但是和性能问题有关,我开了一个新题目。对于整个表的删除,参见:

  • https://stackoverflow.com/questions/75882813/delete-excess-rows-per-group

脚本将在大约5分钟内执行8次。表包含约200K行,与users表相关。(版本:PostgreSQL 12.10)

例子:

id|user_id                             |is_favorite|join_time                    
------------------------------------+------------------------------------+-------
1 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-04 15:16:40.000 +0300
2 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-03 15:16:25.000 +0300
3 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-02 15:16:40.000 +0300
4 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-04-22 15:16:40.000 +0300
5 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-23 15:16:25.000 +0300
6 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-21 15:16:25.000 +0300
7 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-20 15:16:40.000 +0300
8 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-19 15:16:25.000 +0300
9 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-18 15:16:40.000 +0300
10|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-17 15:16:25.000 +0300
11|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-16 15:16:40.000 +0300
12|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-15 15:16:25.000 +0300
13|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-14 15:16:40.000 +0300
14|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-14 15:16:39.000 +0300
15|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 12:16:25.000 +0300
16|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 11:16:25.000 +0300
17|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 10:16:25.000 +0300
18|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 09:16:25.000 +0300
19|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 08:16:25.000 +0300
20|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 07:16:25.000 +0300
21|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 06:16:25.000 +0300
22|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 05:16:25.000 +0300
23|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 04:16:25.000 +0300
24|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 03:16:25.000 +0300
25|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 02:16:25.000 +0300
26|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 01:16:25.000 +0300
27|91a005bc-cf89-11ed-afa1-0242ac120002|true       |2023-01-08 05:27:25.000 +0300
28|91a005bc-cf89-11ed-afa1-0242ac120002|true       |2023-02-09 07:32:25.000 +0300
29|91a005bc-cf89-11ed-afa1-0242ac120002|false      |2023-02-10 08:51:25.000 +0300

对于user_id in '655caab8-ce81-11ed-afa1-0242ac120002, 81c126b6-ce81-11ed-afa1-0242ac120002',这些id必须被删除:11,12,13,14,25,26

解决这个问题最有效的方法是什么?

应用于整个表(所有用户),我对你上一个问题建议的查询速度更快。
仅适用于少数给定用户,我建议:

DELETE FROM tbl t
USING (
SELECT id, row_number() OVER (PARTITION BY user_id
ORDER BY is_favorite DESC, join_time DESC
ROWS UNBOUNDED PRECEDING) AS rn
FROM   tbl t
WHERE  user_id = ANY ({'655caab8-ce81-11ed-afa1-0242ac120002
,81c126b6-ce81-11ed-afa1-0242ac120002}')
) del
WHERE  t.id = del.id
AND    del.rn >10;

筛选早期。
user_id开头的索引会很有帮助。
id为前导的索引在任何情况下都应该存在。

其他注意事项在我之前的回答中有详细说明:

  • DELETE每组多余行

您可以使用CTE(公共表表达式)来实现这一点。

WITH ranked_rows AS (
SELECT
id,
user_id,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY is_favorite DESC, join_time DESC
) AS row_num
FROM your_table_name
WHERE user_id IN ('655caab8-ce81-11ed-afa1-0242ac120002', '81c126b6-ce81-11ed-afa1-0242ac120002')
)
DELETE FROM your_table_name
WHERE id IN (
SELECT id FROM ranked_rows WHERE row_num > 10
);

相关内容

  • 没有找到相关文章

最新更新