帖子上的单词匹配正则表达式性能不佳

我有一个被阻止短语的列表，我想匹配用户输入的文本中是否存在这些短语，但性能非常糟糕。

我正在使用这个查询：

SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text' )) ~* ('[[:<:]]' || value || '[[:>:]]') LIMIT 1;

经过调查，我发现世界边界[[:<:]]，[[:>:]]表现非常糟糕，因为知道blocked_items有 24k 条记录。

例如，当我尝试运行这个时：

SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text ' )) ilike ('%' || value || '%') LIMIT 1;

与第一个相比，它非常快。问题是我需要保留单词边界的测试。

此检查经常在大型程序上执行，因此性能对我来说非常重要。

你们有什么建议可以加快速度吗？

EXPLIAN 分析截图

既然你知道LIKE(~~( 查询很快，而 RegEx (~( 查询很慢，最简单的解决方案是结合这两个条件(m/M相当于[[:<:]]/[[:>:]](：

SELECT value FROM blocked_items
WHERE lower(unaccent('my input text')) ~~ ('%'||value||'%')
AND lower(unaccent('my input text')) ~ ('m'||value||'M')
LIMIT 1;

这样，快速查询条件会过滤掉大部分行，然后慢速查询条件会丢弃剩余的行。

我正在使用更快的区分大小写的运算符，假设value已经规范化。如果不是这种情况，请删除(然后冗余的(lower()，并使用原始查询中区分大小写的版本。

在我的370k行测试集上，将查询从6s(暖(加速到90ms：

                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..1651.85 rows=1 width=10) (actual time=89.702..89.702 rows=1 loops=1)
->  Seq Scan on blocked_items  (cost=0.00..14866.61 rows=9 width=10) (actual time=89.701..89.701 rows=1 loops=1)
Filter: ((lower(unaccent('my input text'::text)) ~~ (('%'::text || value) || '%'::text)) AND (lower(unaccent('my input text'::text)) ~ (('m'::text || value) || 'M'::text)))
Rows Removed by Filter: 153281
Planning Time: 0.097 ms
Execution Time: 89.717 ms
(6 rows)

但是，我们仍在执行全表扫描，性能将根据表中的位置而有所不同。

理想情况下，我们可以通过使用索引在接近恒定的时间内回答查询。

让我们重写查询以使用文本搜索函数和运算符：

SELECT value FROM blocked_items
WHERE to_tsvector('simple', unaccent('my input text'))
@@ phraseto_tsquery('simple', value)
LIMIT 1;

首先，我们将输入拆分为搜索向量，然后检查阻止的短语是否与这些向量匹配。

测试查询大约需要440 毫秒- 从我们的组合查询退后一步：

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..104.01 rows=1 width=10) (actual time=437.761..437.761 rows=1 loops=1)
->  Seq Scan on blocked_items  (cost=0.00..192516.05 rows=1851 width=10) (actual time=437.760..437.760 rows=1 loops=1)
Filter: (to_tsvector('simple'::regconfig, unaccent('my input text'::text)) @@ phraseto_tsquery('simple'::regconfig, value))
Rows Removed by Filter: 153281
Planning Time: 0.063 ms
Execution Time: 437.772 ms
(6 rows)

由于我们不能使用tsvector @@ tsquery来索引tsquery，我们可以再次重写查询，以使用tsquery @> tsquery文本搜索运算符检查被阻止的短语是否是输入短语的子查询，然后可以使用 GiST 运算符类中的tsquery_ops对其进行索引：

CREATE INDEX blocked_items_search ON blocked_items
USING gist (phraseto_tsquery('simple', value));
ANALYZE blocked_items; -- update query planner stats
SELECT value FROM blocked_items
WHERE phraseto_tsquery('simple', unaccent('my input text'))
@> phraseto_tsquery('simple', value)
LIMIT 1;

查询现在可以使用索引扫描，使用相同的数据需要20 毫秒。

由于 GiST 是有损索引，因此查询时间可能会有所不同，具体取决于需要重新检查的次数：

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.54..4.23 rows=1 width=10) (actual time=19.215..19.215 rows=1 loops=1)
->  Index Scan using blocked_items_search on blocked_items  (cost=0.54..1367.01 rows=370 width=10) (actual time=19.214..19.214 rows=1 loops=1)
Index Cond: (phraseto_tsquery('simple'::regconfig, value) <@ phraseto_tsquery('simple'::regconfig, unaccent('my input text'::text)))
Rows Removed by Index Recheck: 4028
Planning Time: 0.093 ms
Execution Time: 19.236 ms
(6 rows)

使用全文搜索的一大优点是，您现在可以通过使用搜索配置 (regconfig( 来使用特定于语言的单词匹配。

上述查询都使用默认的'simple'regconfig 来匹配原始查询的行为。通过切换到'english'您还可以匹配同一单词的变体，如cat和cats(词干提取(和没有意义的常用词，如ormy将被忽略(停用词(。

相关内容

最新更新

热门标签：