如何防止Elasticsearch仅在一个非英语字符上与multi_match匹配



我正在使用elasiticsearch dsl和SmartCN分析器:

from elasticsearch_dsl import analyzer
analyzer_cn = analyzer(
'smartcn',
tokenizer=tokenizer('smartcn_tokenizer'),
filter=['lowercase']
)

我使用multi_match来匹配几个术语:

from elasticsearch_dsl import Q
q_new = Q("multi_match", query="SOME_QUERY", fields=["FEIDL_NAME"])

期望的行为是ES只返回至少有两个字符匹配的文档。我已经查看了文档,但找不到一种方法来阻止Elasticsearch在单个字符上匹配。

如有任何指示/建议,我们将不胜感激。谢谢

所需的行为是ES只返回至少具有两个字符匹配。

我不熟悉SmartCN analyzer,但如果你想至少匹配2个字符,那么根据你的用例,你可以使用N-gram标记器,每当遇到指定字符列表中的一个时,它会首先将文本分解为单词,然后为指定长度的每个单词发出N-gram。

索引映射:

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,    <-- note this
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}

指数数据:

{
"title": "world"
}

分析API

搜索查询将不匹配"title": "w",因为生成的令牌具有最小长度2(因为min_gram在上面的索引映射中定义为2(

生成的令牌为:

POST/_analyze
{
"tokens": [
{
"token": "wo",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "wor",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "worl",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "world",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "or",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "orl",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "orld",
"start_offset": 1,
"end_offset": 5,
"type": "word",
"position": 6
},
{
"token": "rl",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 7
},
{
"token": "rld",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 8
},
{
"token": "ld",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 9
}
]
}
**Search Query:**
{
"query": {
"match": {
"title": "wo"
}
}
}

搜索结果:

"hits": [
{
"_index": "stof_64003025",
"_type": "_doc",
"_id": "2",
"_score": 0.56802315,
"_source": {
"title": "world"
}
}
]

最新更新