在我的弹性搜索结果中,即使匹配一个字符,也会给出结果。当我们看到个位数的结果时,结果看起来很奇怪。
有没有通过DSL查询过滤掉匹配个位数/字符的结果。
当前查询:
GET /attachment_index/_search
{
"_source": [
"user_email_id",
"file_content_id",
"file_name",
"non_indexed_meta_data"
],
"query": {
"bool": {
"must": [
{
"has_child": {
"type": "user_email_id",
"query": {
"match": {
"user_email_id": "test@user.com"
}
},
"inner_hits": {}
}
},
{
"match": {
"attachment.content": {
"query": "mark twain 3",
"analyzer": "english",
"operator": "or"
}
}
}
]
}
},
"highlight": {
"order": "score",
"pre_tags": [
"<strong>"
],
"post_tags": [
"</strong>"
],
"fields": {
"attachment.content": {}
}
},
"size": 100
}
它给出了我不想要的3个匹配的结果。在输入到弹性搜索之前,对长度进行过滤而不进行预处理有什么想法吗?
可以使用自定义分析器根据长度进行筛选。Elasticsearch文档包含一个如何重建英语分析器的示例,以便我们可以在那里添加最小长度过滤器,例如
PUT /attachment_index
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"length": {
"type": "length",
"min": 2
}
},
"analyzer": {
"length_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"length"
]
}
}
}
}
}
尝试一下:
GET attachment_index/_analyze
{
"analyzer": "length_english",
"text": "mark twain 3"
}
返回
{
"tokens" : [
{
"token" : "mark",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "twain",
"start_offset" : 5,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
因此CCD_ 1按要求被过滤掉。在匹配查询中可以使用分析器length_english
代替english
。