如何在python中使用ElasticSearch dsl自定义同义词标记过滤器



我正试图在python中使用ElasticSearch dsl构建一个同义词令牌过滤器,例如,当我尝试搜索"tiny"或"little"时,它也会返回包括"small"在内的文章。这是我的代码:

from elasticsearch_dsl import token_filter
# Connect to local host server
connections.create_connection(hosts=['127.0.0.1'])
spelling_tokenfilter = token_filter(
'my_tokenfilter', # Name for the filter
'synonym', # Synonym filter type
synonyms_path = "analysis/wn_s.pl"
)
# Create elasticsearch object
es = Elasticsearch()
text_analyzer = analyzer('my_tokenfilter',
type='custom',
tokenizer='standard',
filter=['lowercase', 'stop', spelling_tokenfilter])

我在es-7.6.2/config中创建了一个名为"analysis"的文件夹,并下载了Wordnet prolog数据库,并将"wn_s.pl"复制粘贴到其中。但当我运行该程序时,出现了一个错误:

Traceback (most recent call last):
File "index.py", line 161, in <module>
main()
File "index.py", line 156, in main
buildIndex()
File "index.py", line 74, in buildIndex
covid_index.create()
File "C:Anacondalibsite-packageselasticsearch_dslindex.py", line 259, in create
return self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
File "C:Anacondalibsite-packageselasticsearchclientutils.py", line 92, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "C:Anacondalibsite-packageselasticsearchclientindices.py", line 104, in create
"PUT", _make_path(index), params=params, headers=headers, body=body
File "C:Anacondalibsite-packageselasticsearchtransport.py", line 362, in perform_request
timeout=timeout,
File "C:Anacondalibsite-packageselasticsearchconnectionhttp_urllib3.py", line 248, in perform_request
self._raise_error(response.status, raw_data)
File "C:Anacondalibsite-packageselasticsearchconnectionbase.py", line 244, in _raise_error
status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'failed to build synonyms')

有人知道怎么修吗?谢谢

出现这种情况是因为在同义词过滤器(docs(之前已经定义了lowercasestop令牌过滤器

Elasticsearch将使用令牌化器链中同义词过滤器之前的令牌过滤器来解析同义词文件中的条目。因此,例如,如果同义词过滤器放在词干分析器之后,那么词干分析器也将应用于同义词条目。

首先,让我们通过捕获异常来获取有关错误的更多详细信息:

>>> text_analyzer = analyzer('my_tokenfilter',
...                          type='custom',
...                          tokenizer='standard',
...                          filter=[
...                              'lowercase', 'stop',
...                              spelling_tokenfilter
...                              ])
>>>
>>> try:
...   text_analyzer.simulate('blah blah')
... except Exception as e:
...   ex = e
...
>>> ex
RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms'}], 'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms', 'caused_by': {'type': 'parse_exception', 'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}, 'status': 400})

特别是这部分很有趣:

"原因":"第109行的同义词规则无效","caused_by":{"type":"illegal_argument_exception","reason":"term:对位置增量为1(get:2(的令牌(操作(分析的操作过程"}}

这表明它设法找到了文件,但未能解析它。

最后,如果删除这两个令牌过滤器,错误就会消失:

text_analyzer = analyzer('my_tokenfilter',
type='custom',
tokenizer='standard',
filter=[
#'lowercase', 'stop',
spelling_tokenfilter
])
...
>>> text_analyzer.simulate("blah")
{'tokens': [{'token': 'blah', 'start_offset': 0, 'end_offset...}

文档建议使用多路复用器令牌过滤器,以防需要将它们组合在一起。

希望这能有所帮助!

最新更新