我想使用一个Elasticsearch的令牌过滤器,它的行为像word_delimiter_graph,但只在特定的分隔符上分割令牌(如果我没有错,默认的word_delimiter_graph
不允许使用自定义分隔符列表)。
例如,我只想在-
分隔符上分割令牌:
i-pod
->[i-pod, i, pod]
i_pod
->[i_pod]
(因为我只想在-
上分裂,而不是任何其他字符)
我如何存档?
谢谢!
我使用了一些参数type_table
(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus (+) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters.
测试:
i - pad
GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type": "word_delimiter_graph",
"preserve_original": true,
"type_table": [ "_ => ALPHA" ]
},
"text": "i-pad"
}
标记:
{
"tokens": [
{
"token": "i-pad",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0,
"positionLength": 2
},
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "pad",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
}
]
}
i_pad
GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type": "word_delimiter_graph",
"preserve_original": true,
"type_table": [ "_ => ALPHA" ]
},
"text": "i_pad"
}
标记:
{
"tokens": [
{
"token": "i_pad",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}