使用ES 1.4和ES 2.3中的分析仪API获得不同的位置值

我正在从ES 1.4升级到ES 2.3，在测试文档评分时，我注意到相同查询的Explain API输出之间存在差异：

{
    "query": {
        "match": {
            "article_content": "news"
        }
    }
}

从ES 2.3.0我得到：

{
  "value": 0.9890914,
  "description": "fieldWeight in 3931, product of:",
  "details": [
    {
      "value": 5.8309517,
      "description": "tf(freq=34.0), with freq of:",
      "details": [
        {
          "value": 34,
          "description": "termFreq=34.0",
          "details": []
        }
      ]
    },
    {
      "value": 5.428089,
      "description": "idf(docFreq=117, maxDocs=9885)",
      "details": []
    },
    {
      "value": 0.03125,
      "description": "fieldNorm(doc=3931)",
      "details": []
    }
  ]
}

从ES 1.4.2我得到：

{
  "value": 0.9319723,
  "description": "fieldWeight in 403, product of:",
  "details": [
    {
      "value": 5.8309517,
      "description": "tf(freq=34.0), with freq of:",
      "details": [
        {
          "value": 34,
          "description": "termFreq=34.0"
        }
      ]
    },
    {
      "value": 5.114622,
      "description": "idf(docFreq=226, maxDocs=13899)"
    },
    {
      "value": 0.03125,
      "description": "fieldNorm(doc=403)"
    }
  ]
}

我认为我的custom_analyzer可能有问题，所以使用Analyze API:检查

对于ES 2.3，我使用了：

curl -XGET 'localhost:9200/new_index/_analyze' -d '{
  "analyzer" : "custom_text_analyzer",
  "text" : "...."
}'

对于ES 1.4.2，我使用了

curl -XGET 'localhost:9210/new_index2/_analyze?analyzer=custom_text_analyzer' -d '...'

两个调用都产生了相同数量的令牌，唯一的区别是"position":的值

对于ES 2.3.0

{
  "tokens": [
    {
      "token": "show",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    }, ....

对于ES 1.4.2

{
  "tokens": [
    {
      "token": "show",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 1
    }, ....

注：

这两个索引在数量和内容上都有相同的文档。
我测试的文档有289个代币
两个索引的自定义分析器相同。（我检查了两次）

我只是想了解一下可能是什么问题？

得分的差异在这里：

2.3.0：

{
  "value": 5.428089,
  "description": "idf(docFreq=117, maxDocs=9885)",
  "details": []
}

1.4.2：

{
  "value": 5.114622,
  "description": "idf(docFreq=226, maxDocs=13899)"
}

所以IDF是不同的，因为你似乎有不同数量的文件，这些文件中术语的频率也不同。您说您有相同数量的文档，但maxDocs会考虑Lucene碎片中的所有文档所有包括标记为已删除的文档。

我的假设是，你的1.4.x索引中也有一些已删除的文档（这些文档还没有从磁盘上从合并段中物理删除），这些文档会对评分计算产生一定影响。您可以使用curl -XGET "http://localhost:9200/_cat/indices?v"检查已删除的文档数，也可以使用_optimize API（curl -XPOST "http://localhost:9200/my_index/_optimize?max_num_segments=1"）强制合并。不过，请注意，优化确实会占用一些资源，因此您希望在集群不忙的时候进行优化。

相关内容

最新更新

热门标签：