优先处理某些字段的 ES 搜索结果



我正在使用elasticsearch-6.4.3。我创建了一个索引flight-location_methods

settings index: {
analysis: {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase",  "autocomplete_filter"]
}
}
}
}
mapping do
indexes :airport_code, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
indexes :airport_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
indexes :city_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
indexes :country_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
end

上面的代码片段来自我为索引创建的represents the mappingruby 代码。

当我执行此查询时:

GET /flight-location_methods/_search
{
"from": 0,
"size": 1000,
"query": {
"function_score": {
"functions": [
{
"filter": {
"match": {
"city_name": "new yo"
}
},
"weight": 50
},
{
"filter": {
"match": {
"country_name": "new yo"
}
},
"weight": 50
}
],
"max_boost": 200,
"score_mode": "max",
"boost_mode": "multiply",
"min_score": 10
}
}
}

我得到这个结果:

{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "tcoj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Ouvea",
"airport_code": "UVE",
"city_name": "Ouvea",
"country_name": "New Caledonia"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "zMoj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Palmerston North",
"airport_code": "PMR",
"city_name": "Palmerston North",
"country_name": "New Zealand"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "1Moj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Westport",
"airport_code": "WSZ",
"city_name": "Westport",
"country_name": "New Zealand"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "1coj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Whangarei",
"airport_code": "WRE",
"city_name": "Whangarei",
"country_name": "New Zealand"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "Rsoj1G0Bdo5Q9AduxCOi",
"_score": 50,
"_source": {
"airport_name": "Municipal",
"airport_code": "RNH",
"city_name": "New Richmond",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "fsoj1G0Bdo5Q9AduxCOi",
"_score": 50,
"_source": {
"airport_name": "New London",
"airport_code": "GON",
"city_name": "New London",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "gMoj1G0Bdo5Q9AduxCOi",
"_score": 50,
"_source": {
"airport_name": "New Ulm",
"airport_code": "ULM",
"city_name": "New Ulm",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "5coj1G0Bdo5Q9AduxCSi",
"_score": 50,
"_source": {
"airport_name": "Cape Newenham",
"airport_code": "EHM",
"city_name": "Cape Newenham",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "Ycoj1G0Bdo5Q9AduxCWi",
"_score": 50,
"_source": {
"airport_name": "East 60th Street H/P",
"airport_code": "JRE",
"city_name": "New York",
"country_name": "United States"
}
}

如您所见,New York should be on top但实际上并非如此。

另外,我can not use AND operator因为如果搜索文本有多个单词,我希望搜索文本中的任何单词出现在任何字段中。但是,如果所有搜索文本都存在于一个字段中,则优先级应该更高。

我们先来讨论一下 elasticsearch tokenizer 和 tokenization 过程:

分词器接收字符流,将其分解为单独的标记(通常是单个单词(。ES 文档

现在让我们描述一下如何自动完成分析器工作:

标准
  1. 分词器作为标准 Elasticsearch 分词器提供标记(为了简化起见,让我们知道这是单词(
  2. 小写过滤器使所有字符变低。
  3. 然后edge_ngram筛选器将每个单词分解为标记。

从这里开始魔术:我认为您对从 1 到 20 的代币的定义太多了。可能存在包含超过 10 个字符的单词,但对于我们的情况,它无关紧要。此外,仅包含一个字符的令牌对我们不可用。我改变它:

"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 5
}
}

然后在我们的索引中将有很多长度从 2 到 5 个字符的单词部分。现在,当我们知道我们搜索的内容时,我们可以创建映射并编写查询:

{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 0,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 5
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"airport_name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "autocomplete"
}
}
},
"airport_code": {
"type": "keyword",
"fields": {
"ngram": {
"type": "text",
"analyzer": "autocomplete"
}
}
},
"city_name": {
"type": "keyword",
"fields": {
"ngram": {
"type": "text",
"analyzer": "autocomplete"
}
}
},
"country_name": {
"type": "keyword",
"fields": {
"ngram": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
}
}

我用 ngram 字段和常规字段制作字段,以保持进行聚合的能力。例如,这很好通过多个机场查找城市。

现在我们可以运行一个简单的查询来获取纽约:

{
"size": 20, 
"query": {
"query_string": {
"default_field": "city_name.ngram",
"query": "new yo",
"default_operator": "AND"
}
}
}
Answer
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 13.896059,
"hits": [
{
"_index": "test-index",
"_type": "_doc",
"_id": "BtBD2W0BCDulLSY6pKM8",
"_score": 13.896059,
"_source": {
"airport_name": "Flushing",
"airport_code": "FLU",
"city_name": "New York",
"country_name": "United States"
}
}
]
}
}

或者使用提升创建提升或文本查询。这也将在大数据列表上的查询中更有效。

您的查询应如下所示:

{
"query": {
"function_score": {
"query": {
"query_string": {
"query": "new yo",
"analyzer": "autocomplete"
}
},
"functions": [
{
"filter": {"terms": {
"city_name.ngram": [
"new",
"yo"
]
}},
"weight": 2
},
{
"filter": {"terms": {
"country_name.ngram": [
"new",
"yo"
]
}},
"weight": 2
}
],
"max_boost": 30,
"min_score": 5, 
"score_mode": "max",
"boost_mode": "multiply"
}
}
}

在此查询中,纽约将排在第一位,因为我们按查询部分筛选所有不相关的文档。乘以 2 city_name.ngram 字段分数,在此字段中,我们有 2 个令牌,然后此文件将获得最大分数。此外,查询的底线是min_score筛选,而不是相关文档。您可以在此处阅读有关当前 elasticsearch 相关性算法的信息。 顺便说一句,我不想将过滤器放在具有相同权重的函数中。你应该决定是更重要的领域。这使您的搜索更加清晰。

最新更新