Elasticsearch word_delimiter_graph仅在特定分隔符上分割令牌



我想使用一个Elasticsearch的令牌过滤器,它的行为像word_delimiter_graph,但只在特定的分隔符上分割令牌(如果我没有错,默认的word_delimiter_graph不允许使用自定义分隔符列表)。

例如,我只想在-分隔符上分割令牌:

i-pod->[i-pod, i, pod]

i_pod->[i_pod](因为我只想在-上分裂,而不是任何其他字符)

我如何存档?

谢谢!

我使用了一些参数type_table

(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus (+) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters.

测试:

i - pad

GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type": "word_delimiter_graph",
"preserve_original": true,
"type_table": [ "_ => ALPHA" ]
},
"text": "i-pad"
}

标记:

{
"tokens": [
{
"token": "i-pad",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0,
"positionLength": 2
},
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "pad",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
}
]
}

i_pad

GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type": "word_delimiter_graph",
"preserve_original": true,
"type_table": [ "_ => ALPHA" ]
},
"text": "i_pad"
}

标记:

{
"tokens": [
{
"token": "i_pad",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}

最新更新