弹性搜索同义词完全被分析器消除



我使用同义词文件在elasticsearch中创建同义词,我的要求是显示不同大小的相框。

例如——

6x9, 6 x 9 => 6x9

但是当我关闭并重新打开索引时,我得到以下错误:

{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "failed to build synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "failed to build synonyms",
"caused_by": {
"type": "parse_exception",
"reason": "Invalid synonym rule at line 107",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "term: 6 x 9 was completely eliminated by analyzer"
}
}
},
"status": 400
}

8x10, 8 x 10 => 8x10

这意味着它只有在x后至少有2个数字时才有效。10在8 × 10。关于6x9,它工作得很好。唯一的问题是6 x 9,因为它有空格,最后一个数字是单个的。但如果我把它改为6 x 09,它就会工作得很好。

设置-

"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"suggestions_shingle": {
"max_shingle_size": "4",
"min_shingle_size": "2",
"type": "shingle"
},
"english_stemmer_filter": {
"name": "minimal_english",
"type": "stemmer"
},
"edgeNGram_filter": {
"min_gram": "2",
"side": "front",
"type": "edgeNGram",
"max_gram": "20"
}
},
"analyzer": {
"whitespace_punc_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"word_delimiter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"edge_nGram_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"synonym_filter"
],
"type": "custom",
"tokenizer": "edge_ngram_tokenizer"
},
"path_analyzer_lc": {
"filter": [
"lowercase"
],
"tokenizer": "path_tokenizer"
},
"stemmer_synonym_analyzer": {
"filter": [
"synonym_filter",
"lowercase",
"english_stemmer_filter"
],
"tokenizer": "standard"
},
"whitespace_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "whitespace"
},
"synonym_analyzer": {
"filter": [
"synonym_filter",
"lowercase",
"edgeNGram_filter"
],
"tokenizer": "standard"
},
"edge_nGram_shingle_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"synonym_filter",
"suggestions_shingle"
],
"type": "custom",
"tokenizer": "edge_ngram_tokenizer"
},
"path_analyzer": {
"tokenizer": "path_tokenizer"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "6"
},
"path_tokenizer": {
"ignore_case": "true",
"type": "path_hierarchy",
"delimiter": ">"
}
}}

提前感谢!

这是因为edge_ngram_tokenizer标记器将min_gram设置为2,因此,它不能为单个字符输入生成任何标记。

POST _analyze
{
"text": "6 x 9",
"tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "6"
}
}

=比;令牌:[]

对于8 x 10,只生成令牌10,这可能也不是您想要的:

POST _analyze
{
"text": "8 x 10",
"tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "6"
}
}

=比;令牌:[10]

所以你得到这个错误消息的原因是因为标记器没有产生任何标记,然后标记过滤器没有什么可考虑的。

最新更新