我具有以下同义词扩展:
suco => suco, refresco, bebida de soja
我想要的是以这种方式将搜索标记:
搜索" suco de laranja"将被标记为[" suco"," laranja"," Refresco"," bebida de soja"]。
,但我将其归为[" suco"," laranja"," refresco"," bebida"," soja"]。
考虑" de "字是一个停止字。我希望在" bebida de laranja"变成[bebida'," laranja"]之类的查询中被忽略。但是我不希望它被考虑在同义词象征上,因此" bebida de soja"仍然是一个令牌" bebida de soja"。
我的设置:
{
"settings":{
"analysis":{
"filter":{
"synonym_br":{
"type":"synonym",
"synonyms":[
"suco => suco, refresco, bebida de soja"
]
},
"brazilian_stop":{
"type":"stop",
"stopwords":"_brazilian_"
}
},
"analyzer":{
"synonyms":{
"filter":[
"synonym_br",
"lowercase",
"brazilian_stop",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}
我建议您进行两个更改。第一个直接与您提出的问题有关,第二个是一个建议。
-
而不是使用多个同义词的扩展,请执行相反的内容,即所有同义词指向一个单词同义词。因此,将
"suco => suco, refresco, bebida de soja"
更改为"suco, refresco, bebida de soja => suco"
-
更改
synonyms
分析仪中过滤器的顺序。将lowercase
放在synonym_br
之前。这将确保情况不会影响synonym_br
令牌过滤器。
因此,最终设置将是:
{
"settings": {
"analysis": {
"filter": {
"synonym_br": {
"type": "synonym",
"synonyms": [
"suco, refresco, bebida de soja => suco"
]
},
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"synonyms": {
"filter": [
"lowercase",
"synonym_br",
"brazilian_stop",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}
这是如何工作的?
用于输入bebida de soja
过滤器按以下顺序应用:
Input Filter Result tokens
====================================
lowercase bebida, de, soja
synonym_br suco <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop suco
asciifolding suco
让我们看一下brazilian_stop
。为此,我们需要一个与同义词不匹配但包含de
的输入。例如。de soja
:
Input Filter Result tokens
=================================
lowercase de, soja
synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop soja <------- de is removed as it is a stopword
asciifolding soja