aws-opensearch:为什么相似的数据集排名不同



我已经设置了一个AWS Opensearch实例,几乎所有内容都设置为默认值。然后我插入了一些关于酒店的数据。当用户像Good Morning B一样搜索时,我得到的查询POST请求如下所示:

{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "good morning b*",
"fields": ["name"],
"default_operator": "and"
}
},
{
"match": {
"provider": "SomeProvider"
}
}
]
}
}
"sort": {
"_score": {
"order": "desc"
},
"name.keyword": {
"order": "asc"
}
}
}

结果包含2个不同酒店的4个条目。除了ID之外,索引中的名称和所有其他数据都是相同的。以下是回复摘录:

{
"took": 442,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "hotels",
"_type": "_doc",
"_id": "1",
"_score": 11.143229,
"_source": {
"id": "1",
"name": "Good Morning + Berlin City East",
"provider": "SomeProvider"
},
"sort": [
11.143229,
"Good Morning + Berlin City East"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "2",
"_score": 10.455675,
"_source": {
"id": "2",
"name": "Good Morning Bad Oldesloe",
"provider": "SomeProvider"
},
"sort": [
10.455675,
"Good Morning Bad Oldesloe"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "3",
"_score": 10.455675,
"_source": {
"id": "3",
"name": "Good Morning Bad Oldesloe",
"provider": "SomeProvider"
},
"sort": [
10.455675,
"Good Morning Bad Oldesloe"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "4",
"_score": 9.6945305,
"_source": {
"id": "4",
"name": "Good Morning + Berlin City East",
"provider": "SomeProvider"
},
"sort": [
9.6945305,
"Good Morning + Berlin City East"
]
}
]
}
}

你可以看到;早上好+柏林城东";有两个不同的条目等级。正如我所说,包含的数据完全相同。由于名字是一样的,我本希望它能像";Good Morning Bad Oldesloe"酒店

我用explain=true参数运行了同样的查询,并为Berlin条目得到了这个查询(我只在这里发布了相关部分,以使其有点紧凑(:

// ID = 1
{
"sort": [
11.143229,
"Good Morning + Berlin City East"
],
"_explanation": {
"value": 11.143229,
"description": "sum of:",
"details": [
{
"value": 9.302926,
"description": "sum of:",
"details": [
{
"value": 4.151463,
"description": "weight(name:good in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.151463,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.811831,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3921644,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.6001415,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 4.151463,
"description": "weight(name:morning in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.151463,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.811831,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3921644,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.6001415,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0,
"description": "name:b*",
"details": []
}
]
},
{
"value": 1.840302,
"description": "weight(provider:hob in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.840302,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.8403021,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 224,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 1.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
// ID = 2{
"sort": [
9.6945305,
"Good Morning + Berlin City East"
],
"_explanation": {
"value": 9.6945305,
"description": "sum of:",
"details": [
{
"value": 7.975009,
"description": "sum of:",
"details": [
{
"value": 3.4875045,
"description": "weight(name:good in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.4875045,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.0562115,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 24,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.39081526,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.5749645,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 3.4875045,
"description": "weight(name:morning in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.4875045,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.0562115,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 24,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.39081526,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.5749645,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0,
"description": "name:b*",
"details": []
}
]
},
{
"value": 1.719521,
"description": "weight(provider:hob in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.719521,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.719521,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 253,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 1.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}

排名差异的主要差异和原因似乎是n, number of documents containing term,在排名较高的id=1的情况下为11,在排名较低的id=2的情况下是24。但是,既然每个数据字段都是相同的(除了id(,它不应该是相同的数字吗?两个条目的搜索词相同。

有人能向我解释一下吗(请用简单的语言,不需要太多数学(为什么这家酒店有区别,而Bad Oldesloe的酒店没有区别(正如人们所料,这里的解释中的数字是一样的(?

提前感谢

文档数量不是由Elasticsearch计算的,而是由底层的Lucene引擎计算的,并且是按碎片计算的(每个碎片都是一个完整的Lucene索引(。由于你的文档(可能(在不同的碎片中,它们的分数会略有不同。

相关内容

最新更新