我有一个Postgres表,有这些列:
id int8
, user_id varchar
, is_favorite boolean
, join_time timestamptz
我想在某些条件下删除这个表中的一些行:
- 每个
user_id
最多保留10行。 - 这10行必须包含每个
user_id
的is_favorite=true
行(每个user_id
不能超过5行is_favorite=true
) - 其余的10行必须是最新的
join_time
。
我想删除这个表中每个user_id
超过10的行,并且有多个user_id
条件。
有一个题目没有user_id
标准,但是和性能问题有关,我开了一个新题目。对于整个表的删除,参见:
- https://stackoverflow.com/questions/75882813/delete-excess-rows-per-group
脚本将在大约5分钟内执行8次。表包含约200K行,与users
表相关。(版本:PostgreSQL 12.10)
例子:
id|user_id |is_favorite|join_time
------------------------------------+------------------------------------+-------
1 |655caab8-ce81-11ed-afa1-0242ac120002|true |2023-03-04 15:16:40.000 +0300
2 |655caab8-ce81-11ed-afa1-0242ac120002|true |2023-03-03 15:16:25.000 +0300
3 |655caab8-ce81-11ed-afa1-0242ac120002|true |2023-03-02 15:16:40.000 +0300
4 |655caab8-ce81-11ed-afa1-0242ac120002|false |2023-04-22 15:16:40.000 +0300
5 |655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-23 15:16:25.000 +0300
6 |655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-21 15:16:25.000 +0300
7 |655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-20 15:16:40.000 +0300
8 |655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-19 15:16:25.000 +0300
9 |655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-18 15:16:40.000 +0300
10|655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-17 15:16:25.000 +0300
11|655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-16 15:16:40.000 +0300
12|655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-15 15:16:25.000 +0300
13|655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-14 15:16:40.000 +0300
14|655caab8-ce81-11ed-afa1-0242ac120002|false |2023-03-14 15:16:39.000 +0300
15|81c126b6-ce81-11ed-afa1-0242ac120002|true |2023-03-01 12:16:25.000 +0300
16|81c126b6-ce81-11ed-afa1-0242ac120002|true |2023-03-01 11:16:25.000 +0300
17|81c126b6-ce81-11ed-afa1-0242ac120002|true |2023-03-01 10:16:25.000 +0300
18|81c126b6-ce81-11ed-afa1-0242ac120002|true |2023-03-01 09:16:25.000 +0300
19|81c126b6-ce81-11ed-afa1-0242ac120002|true |2023-03-01 08:16:25.000 +0300
20|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 07:16:25.000 +0300
21|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 06:16:25.000 +0300
22|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 05:16:25.000 +0300
23|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 04:16:25.000 +0300
24|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 03:16:25.000 +0300
25|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 02:16:25.000 +0300
26|81c126b6-ce81-11ed-afa1-0242ac120002|false |2023-03-01 01:16:25.000 +0300
27|91a005bc-cf89-11ed-afa1-0242ac120002|true |2023-01-08 05:27:25.000 +0300
28|91a005bc-cf89-11ed-afa1-0242ac120002|true |2023-02-09 07:32:25.000 +0300
29|91a005bc-cf89-11ed-afa1-0242ac120002|false |2023-02-10 08:51:25.000 +0300
对于user_id in '655caab8-ce81-11ed-afa1-0242ac120002, 81c126b6-ce81-11ed-afa1-0242ac120002'
,这些id必须被删除:11,12,13,14,25,26
应用于整个表(所有用户),我对你上一个问题建议的查询速度更快。
仅适用于少数给定用户,我建议:
DELETE FROM tbl t
USING (
SELECT id, row_number() OVER (PARTITION BY user_id
ORDER BY is_favorite DESC, join_time DESC
ROWS UNBOUNDED PRECEDING) AS rn
FROM tbl t
WHERE user_id = ANY ({'655caab8-ce81-11ed-afa1-0242ac120002
,81c126b6-ce81-11ed-afa1-0242ac120002}')
) del
WHERE t.id = del.id
AND del.rn >10;
筛选早期。
以user_id
开头的索引会很有帮助。
以id
为前导的索引在任何情况下都应该存在。
其他注意事项在我之前的回答中有详细说明:
- DELETE每组多余行
您可以使用CTE(公共表表达式)来实现这一点。
WITH ranked_rows AS (
SELECT
id,
user_id,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY is_favorite DESC, join_time DESC
) AS row_num
FROM your_table_name
WHERE user_id IN ('655caab8-ce81-11ed-afa1-0242ac120002', '81c126b6-ce81-11ed-afa1-0242ac120002')
)
DELETE FROM your_table_name
WHERE id IN (
SELECT id FROM ranked_rows WHERE row_num > 10
);