自定义分析器，用例:邮政编码[ElasticSearch]

设一个集合索引/类型，命名为customers/customer。该集合中的每个文档都有一个邮政编码作为属性。基本上，邮政编码可以是：

字符串(例如：8907-1009(
字符串(例如：211-20(
字符串(ex:30200(

我想设置我的索引分析器，以获得尽可能多的匹配文档。目前，我的工作方式是：

PUT /customers/
{
"mappings":{
"customer":{
"properties":{
"zip-code": {
"type":"string"
"index":"not_analyzed"
}
some string properties ...
}
}
}

当我搜索文档时，我使用的是该请求：

GET /customers/customer/_search
{
"query":{
"prefix":{
"zip-code":"211-20"
}
}
}

如果你想严格搜索，那就行了。但是，例如，如果邮政编码是"200 30"，那么用"200-30"搜索将不会给出任何结果。为了不出现这个问题，我想给我的指数分析器下达命令。有人能帮我吗？谢谢

附言：如果你想了解更多信息，请告诉我；(

只要您想查找变体，就不想使用not_analyzed。

让我们用一个不同的映射来尝试这个：

PUT zip
{
"settings": {
"number_of_shards": 1, 
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "standard",
"filter": [ ]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}

我们正在使用标准的令牌化器；字符串将在空格和标点符号(包括破折号(处分解为标记。如果运行以下查询，您可以看到实际的令牌：

POST zip/_analyze
{
"analyzer": "zip_code",
"text": ["8907-1009", "211-20", "30200"]
}

添加您的示例：

POST zip/_doc
{
"zip": "8907-1009"
}
POST zip/_doc
{
"zip": "211-20"
}
POST zip/_doc
{
"zip": "30200"
}

现在查询似乎工作正常：

GET zip/_search
{
"query": {
"match": {
"zip": "211-20"
}
}
}

如果你只搜索"211"，这也会起作用。然而，这可能过于宽松，因为它也会发现"20"、"20-211"、"211-10"、，。。。

您可能想要的是短语搜索，其中查询中的所有令牌都需要在字段中，并且顺序正确：

GET zip/_search
{
"query": {
"match_phrase": {
"zip": "211"
}
}
}

添加：

如果邮政编码具有分层含义(如果您有"211-20"，您希望在搜索"211"时找到它，但在搜索"20"时找不到它(，则可以使用path_hierarchy标记器。

因此，将映射更改为：

PUT zip
{
"settings": {
"number_of_shards": 1, 
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}

使用上面的3个文档，您现在可以使用match查询：

GET zip/_search
{
"query": {
"match": {
"zip": "1009"
}
}
}

"1009"找不到任何东西，但"8907"或"8907-1009"会找到。

如果你也想找到"1009"，但分数较低，你必须分析我显示的两种变体的邮政编码(结合两个版本的映射(：

PUT zip
{
"settings": {
"number_of_shards": 1, 
"analysis": {
"analyzer": {
"zip_hierarchical": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
},
"zip_standard": {
"tokenizer": "standard",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_standard",
"fields": {
"hierarchical": {
"type": "text",
"analyzer": "zip_hierarchical"
}
}
}
}
}
}
}

添加具有相反顺序的文档以正确测试它：

POST zip/_doc
{
"zip": "1009-111"
}

然后搜索两个字段，但使用分层标记器将其中一个字段提升3:

GET zip/_search
{
"query": {
"multi_match" : {
"query" : "1009",
"fields" : [ "zip", "zip.hierarchical^3" ] 
}
}
}

然后你可以看到，"1009-111"的得分比"8907-1009"高得多。

相关内容

最新更新

热门标签：