我正在使用PostgreSQL 10 + pg_trgm扩展。
表布局:
Column | Type | Collation | Nullable | Default | Storage |
--------------+-------------------+-----------+----------+-----------------------+----------+
id | integer | | not null | | plain |
reps | integer | | | 1 | plain |
user | integer | | | | plain |
ip | character varying | | not null | ''::character varying | extended |
visittime | integer | | | | plain |
domain | character varying | | | | extended |
address | text | | | | extended |
method | character varying | | | | extended |
mime | character varying | | | | extended |
duration | integer | | | | plain |
size | bigint | | | | plain |
req_status | integer | | | | plain |
http_status | integer | | | | plain |
xproxymeta | integer | | | | plain |
Indexes:
"http_requests_pkey" PRIMARY KEY, btree (id)
"http_trgm_idx" gin (address gin_trgm_ops)
"md_userid_idx" btree ("user" DESC)
"md_visittime_idx" btree (visittime DESC)
Foreign-key constraints:
"http_requests_user_fkey" FOREIGN KEY ("user") REFERENCES username_id(id)
Triggers:
add_occupied_space_record_num AFTER INSERT ON http_requests FOR EACH ROW EXECUTE PROCEDURE add_occupied_space_record_num_func()
count_repeated_records BEFORE INSERT ON http_requests FOR EACH ROW EXECUTE PROCEDURE count_repeated_records_func()
delete_occupied_space_record_num AFTER DELETE ON http_requests FOR EACH ROW EXECUTE PROCEDURE delete_occupied_space_record_num_func()
请注意,在address
列上有一个GIN Trigram全文索引。这个表目前有~ 1000万条记录。
现在这两个相同的查询会产生非常不同的计划。第一个需要56ms,第二个需要~25秒。
# explain analyze select * from http_requests where address ilike '%abc%' order by visittime desc limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..60.93 rows=10 width=209) (actual time=2.862..21.725 rows=10 loops=1)
-> Index Scan using md_visittime_idx on http_requests (cost=0.43..500074.21 rows=82654 width=209) (actual time=2.861..21.719 rows=10 loops=1)
Filter: (address ~~* '%abc%'::text)
Rows Removed by Filter: 6663
Planning time: 0.279 ms
Execution time: 21.751 ms
(6 rows)
现在是相同的查询,只是搜索模式不同:xyz
# explain analyze select * from http_requests where address ilike '%xyz%' order by visittime desc limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5246.47..5246.50 rows=10 width=209) (actual time=23367.849..23367.870 rows=10 loops=1)
-> Sort (cost=5246.47..5248.52 rows=818 width=209) (actual time=23367.846..23367.848 rows=10 loops=1)
Sort Key: visittime DESC
Sort Method: top-N heapsort Memory: 34kB
-> Bitmap Heap Scan on http_requests (cost=2090.34..5228.79 rows=818 width=209) (actual time=17.202..23352.607 rows=18926 loops=1)
Recheck Cond: (address ~~* '%xyz%'::text)
Heap Blocks: exact=18243
-> Bitmap Index Scan on http_trgm_idx (cost=0.00..2090.14 rows=818 width=0) (actual time=12.342..12.342 rows=18926 loops=1)
Index Cond: (address ~~* '%xyz%'::text)
Planning time: 0.190 ms
Execution time: 23368.164 ms
(11 rows)
为什么计划如此不同,我如何修复缓慢的查询?
PostgreSQL知道有很多行包含abc
,但是很少包含xyz
。
因此,在第一种情况下,它正确地决定使用支持ORDER BY
子句的索引,并丢弃不符合ILIKE
条件的行,直到找到10个结果行。
这种策略在第二种情况下就不那么有效了,因为PostgreSQL在找到10个结果之前必须搜索表的更大一部分,所以它决定使用三元组索引并对少数结果进行排序。
不幸的是,PostgreSQL的估计是错误的(18926行而不是估计的818行),所以位图堆扫描需要很长时间。
尝试'ANALYZE'表,也许使用更高的default_statistics_target
,以获得更好的估计。
如果所有都失败,将条件更改为类似
的内容WHERE address || '' ILIKE '%abc%'
让PostgreSQL不使用三元组索引