匹配短语查询未按预期工作

从弹性文档中读取：

match_phrase查询首先分析查询字符串以生成术语列表。然后，它搜索所有术语，但只将包含所有搜索术语的文档保存在彼此的相同位置。

我已将我的分析器配置为使用带有关键字标记器的edge_ngram：

{
"index": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}

以下是用于索引的java类：

@Document(indexName = "myindex", type = "program")
@Getter
@Setter
@Setting(settingPath = "/elasticsearch/settings.json")
public class Program {

@org.springframework.data.annotation.Id
private Long instanceId;
@Field(analyzer = "autocomplete",searchAnalyzer = "autocomplete",type = FieldType.String )
private String name;
}

如果我在文档"helloworld"中有以下短语，则以下查询将与其匹配：

{
"match" : {
"name" : {
"query" : "ho",
"type" : "phrase"
}
}
}
result : "hello world"

这不是我所期望的，因为并不是文档中的所有搜索词。

我的问题：

1-查询"ho"的edge_ngram/autocomplete中不应该有两个搜索词吗？(术语应分别为"h"one_answers"ho"。)

2-当短语查询定义中的所有术语都不匹配时，为什么"ho"与"hello world"匹配？("ho"一词不应该匹配)

更新：

以防问题不清楚。匹配短语查询应该分析字符串以列出术语，这里是ho。现在我们有两个项，因为这是edge_ngram和1min_gram。这两个术语是h和ho。根据弹性搜索，文档必须包含所有的搜索项。然而，hello world只有h，没有ho，为什么我在这里得到了匹配？

如果你能为你的问题提供完整的、可运行的例子，它会让你更容易得到帮助。例如：

PUT test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
PUT test/_doc/1
{
"name": "Hello world"
}
GET test/_search
{
"query": {
"match_phrase": {
"name": "hello foo"
}
}
}

根据您的搜索查询判断，您使用的是Elasticsearch2.x或更早版本。这是一个死版本——你真的应该升级。
我不确定在边格上搜索短语在组合中有多大意义。你想在这里实现什么
为什么匹配？您的搜索查询正在通过与存储字段相同的分析器运行。由于您已经定义了min_gram: 1，您的ho将被搜索为h和ho。CCD_ 13与来自CCD_ 15的CCD_。CCD_ 16或CCD_

如果我理解你的问题，tokenizer就是问题所在，"tokenizer"："keyword"，搜索精确的短语和索引。

结构化文本标记器

我从弹性搜索论坛得到了答案：

您正在使用edge_ngram令牌筛选器。让我们看看分析器如何处理查询字符串"ho"。假设您的索引名为my_index:

GET my_index/_analyze
{
"text": "ho",
"analyzer": "autocomplete"
}

响应显示您的分析器的输出将是位置0:处的两个令牌

{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "ho",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}

Elasticsearch对同一位置的两个令牌的查询做什么？它将查询视为"OR"，即使您使用类型"phrase"也是如此。您可以从validate API的输出中看到这一点(它向您显示了写入查询的Lucene查询)：

GET my_index/_validate/query?rewrite=true
{
"query": {
"match": {
"name": {
"query": "ho",
"type": "phrase"
}
}
}
}

因为您的查询和文档的位置0都有一个h，所以该文档将成为热门。

现在，如何解决这个问题？您可以使用edge_ngram令牌生成器来代替edge_nram令牌过滤器。此令牌化器递增其输出的每个令牌的位置。

因此，如果你创建这样的索引：

PUT my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "autocomplete_tokenizer",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
}

您将看到此查询不再是热门：

GET my_index/_search
{
"query": {
"match": {
"name": {
"query": "ho",
"type": "phrase"
}
}
}
}

但举个例子，这个是：

GET my_index/_search
{
"query": {
"match": {
"name": {
"query": "he",
"type": "phrase"
}
}
}
}

相关内容

最新更新

热门标签：