在 Elasticsearch 中使用短语匹配时，忽略查询字符串中过滤的单词

我正在使用自定义索引分析器来删除一组特定的停用词。然后，我使用包含一些停用词的文本进行短语匹配查询。我希望停用词从查询中过滤掉，但它们不是(并且任何不包含它们的文档都被排除在结果之外)。

以下是我正在尝试执行的操作的简化示例：

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create index, with a custom analyzer to filter out the word 'foo'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"filter": [
"fooFilter"
]
}
},
"filter": {
"fooFilter": {
"type": "stop",
"stopwords": [
"foo"
]
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}'

# Add sample document
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"myDocument"}}
{"myMessage":"bar baz"}
'

如果我在查询中间使用过滤的停用词对此索引执行phrase_match搜索，我希望它匹配(因为'foo'应该被我们的分析器过滤掉)。

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "bar foo baz"
}
}
}
}
'

但是，我没有得到任何结果。

有没有办法指示 Elasticsearch 在执行搜索之前对查询字符串进行标记化和过滤？

Edit 1：现在我更加困惑了。我之前看到如果我的查询在查询文本中间包含停用词，则短语匹配不起作用。现在，此外，我看到如果文档在查询文本中间包含停用词，则短语查询不起作用。下面是一个最小示例，仍然使用上面的映射。

POST play/myDocument
{
"myMessage": "fib foo bar"  <---- remember that 'foo' is a stopword and is filtered out of analysis
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}

此查询不匹配。我对此感到非常惊讶！我希望 foo 停止词被过滤掉并忽略。

有关我为什么期望这样做的示例，请参阅以下查询：

POST play/myDocument
{
"myMessage": "fib 123 bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}

这匹配，因为'123'被我的'letter'分词器过滤掉了。似乎短语匹配完全忽略了停用词过滤，并且就好像这些标记一直在分析字段中一样(即使它们没有显示在_analyze的标记列表中)。

我目前最好的解决方法：

使用自定义分析器针对文档的文本字符串调用_analyze终结点。这将从原始文本字符串返回标记，但为我删除讨厌的停用词
仅使用标记将我的文本版本保存到文档中的"filtered"字段中

稍后，在查询时：

使用自定义分析器针对我的查询字符串调用_analyze终结点，以仅获取令牌
针对文档的新"filtered"字段使用筛选的标记字符串进行短语匹配查询

应该有效的解决方法：

使用自定义分析器针对我的查询字符串调用_analyze终结点。这将从原始查询字符串返回令牌，但为我删除讨厌的停用词
使用筛选的令牌进行短语匹配查询

但是，这显然需要对我的每个查询进行两次调用 Elasticsearch。如果可能的话，我想找到一个更好的解决方案。

事实证明，如果要使用短语匹配，则令牌过滤器为时已晚，无法删除不需要的单词。此时，重要令牌的position字段被过滤令牌的存在所污染，并且短语匹配拒绝工作。

答案 -在我们进入令牌过滤器级别之前进行过滤。我创建了一个char_filter，删除了我们不需要的术语和短语匹配开始正常工作！

PUT play 
{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"char_filter": [
"fooFilter"
]
}
},
"char_filter": {
"fooFilter": {
"type": "pattern_replace",
"pattern": "(foo)",
"replacement": ""
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}

查询：

POST play/myDocument
{
"myMessage": "fib bar"
}

GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib foo bar"
}
}
}
}

和

POST play/myDocument
{
"myMessage": "fib foo bar"
}

GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}

现在两者都有效！

解决方案

这是类似问题的替代解决方案 - 但删除英语停用词并处理多值字段;在v7.10上测试。它不需要显式使用char_filter，它使用带有english stop words的standard analyzer并使字段成为text，因此它应该正确处理match_phrases：

PUT play
{
"settings": {
"analysis": {
"analyzer": {
"phrase_analyzer": {
"type": "standard",
"stopwords": "_english_" //for my use case
}
}
}
},
"mappings": {
// "myDocument" is not used in v7.x
"properties": {
"myMessage": {
"analyzer": "phrase_analyzer",
"type": "text" //changed to handle match_phrase
}
}
}
}

对于此演示数据：

POST _bulk
{ "index": { "_index": "play", "_id": "1" } }
{ "myMessage": ["Guardian of the Galaxy"]}
{ "index": { "_index": "play", "_id": "2" } }
{ "myMessage": ["Ambassador of Peace", "Guardian of the Galaxy"]}
{ "index": { "_index": "play", "_id": "3" } }
{ "myMessage": ["Guardian of the Galaxy and Ambassador of Peace"]}
{ "index": { "_index": "play", "_id": "4" } }
{ "myMessage": ["Ambassador of Peace and Guardian of the Galaxy"]}
{ "index": { "_index": "play", "_id": "5" } }
{ "myMessage": ["Supreme Galaxy and All Living Beings Guardian"]}
{ "index": { "_index": "play", "_id": "6" } }
{ "myMessage": ["Guardian of the Sun", "Worker of the Galaxy"]}

查询 1：

GET play/_search
{
"query": {
"match_phrase": {
"myMessage": {
"query": "guardian of the galaxy",
"slop": 99 //useful on multi-values text fields
//https://www.elastic.co/guide/en/elasticsearch/reference/7.10/position-increment-gap.html
}
}
}
}

应该返回文档 1 到 5，因为每个文档至少具有与"guardian"或"galaxy"匹配的值;而文档 6 将不是匹配项，因为这些单词中的每一个都匹配在不同的值上，但不相同(这就是我们使用slop=99的原因)。

查询 2：


GET play/_search
{
"query": {
"match_phrase": {
"myMessage": {
"query": ""guardian of the galaxy"",
"slop": 99
}
}
}
}

应仅返回文档 1 到 4，因为(转义的)双引号强制每个值完全匹配子字符串，而文档 5 将 2 个单词放在不同的位置。

解释

问题是您使用了stop token filter1 ...

不允许令牌筛选器更改每个令牌的位置或字符偏移量。

和一个match_phrase查询，但是 2...

match_phrase查询分析文本，并从分析的文本中创建短语查询。

因此，在应用停止令牌筛选器之前，已经计算了position，并且match_phrase依赖于它来计算匹配项。'123'工作正常，因为letter tokenizer确实定义了 1position，所以match_phrase很高兴！

分词器还负责记录每个术语的顺序或位置。

例外情况 - 0.3% 为误报

在用更大的数据种类测试这个解决方案后，我发现了一些特殊的误报——大约占 4k 搜索结果的 0.3%。在我的特殊情况下，我在filter中使用match_phrase。为了重现误报，我们可以从第 6 项中切换值的顺序，以便单词"Galaxy"和"Guardian"看起来彼此接近：

POST _bulk
{ "index": { "_index": "play", "_id": "7" } }
{ "myMessage": ["Worker of the Galaxy", "Guardian of the Sun"]}

以前的查询 1 也会返回它，而它显然不应该返回它。我无法使用Elasticsearch API 来解决它，而是通过以编程方式从查询 1 中删除停用词来实现的(见下文)。

查询 3：

GET play/_search
{
"query": {
"match_phrase": {
"myMessage": {
"query": "guardian galaxy", //manually removed "of" and "the" stop words
"slop": 99 //useful on multi-values text fields
//https://www.elastic.co/guide/en/elasticsearch/reference/7.10/position-increment-gap.html
}
}
}
}

<小时 />

解决方案

解释

例外情况 - 0.3% 为误报

更多信息：

相关内容

最新更新

热门标签：