Azure认知搜索-如何防止EdgeNGram标记器在连字符处不中断单词?



下面是我如何用搜索请求模型为cosmos db文档创建Azure搜索索引(为了简洁,从搜索请求模型中排除了一些字段)。

请建议在下面的实现中所需要的更改,以防止edgeNgramTokenFilterV2令牌过滤器不以连字符分隔单词。

public class SearchRequest
{
[SimpleField(IsKey = true, IsFilterable = true)]
public string id { get; set; }
[SearchableField(SearchAnalyzerName = LexicalAnalyzerName.Values.StandardLucene, IndexAnalyzerName = "prefixEdgeAnalyzer")]
public string EntityID { get; set; }
public MetaData? MetaData { get; set; }
}
public class MetaData
{
[SearchableField(AnalyzerName = LexicalAnalyzerName.Values.EnMicrosoft)]
public string? CustomerName { get; set; }
[SearchableField(SearchAnalyzerName = LexicalAnalyzerName.Values.StandardLucene, IndexAnalyzerName = "prefixEdgeAnalyzer")]
public List<string>? OpportunityIDs { get; set; }
}


public async Task<Response<SearchIndex>> CreateIndex(string indexName)
{
try
{
var nedgeTokenfilter = new EdgeNGramTokenFilter("edgeNgramTokenFilterV2");
nedgeTokenfilter.MinGram = 3;
nedgeTokenfilter.MaxGram = 20;
nedgeTokenfilter.Side = EdgeNGramTokenFilterSide.Front;
var prefixEdgeAnalyzer = new CustomAnalyzer("prefixEdgeAnalyzer", LexicalTokenizerName.Standard);
prefixEdgeAnalyzer.TokenFilters.Add(TokenFilterName.Lowercase);
prefixEdgeAnalyzer.TokenFilters.Add("edgeNgramTokenFilterV2");
var suggester = new SearchSuggester("spellCheckSuggester", $"MetaData/{nameof(SearchRequest.MetaData.CustomerName)}"); //for spell check
FieldBuilder fieldBuilder = new FieldBuilder();
var searchFields = fieldBuilder.Build(typeof(SearchRequest));
var definition = new SearchIndex(indexName, searchFields);
definition.TokenFilters.Add(nedgeTokenfilter);
definition.Analyzers.Add(prefixEdgeAnalyzer);
definition.Suggesters.Add(suggester);
var response = await _adminClient.CreateOrUpdateIndexAsync(definition).ConfigureAwait(false);

return response;
}
catch (Exception ex)
{
_logger.LogError(ex, ex.Message);
throw;
}
}

在使用Analyze API时,我可以看到文本- "7-ETREW"如果标记为etr, etre, etrw。而我需要被标记为7-e, 7-et, 7-etr, 7-etre, 7-etrew

https://{myServicename}.search.windows.net/indexes/{MyIndexname}/analyze?api-version=2020-06-30
{
"text": "7-etrew",
"analyzer": "prefixEdgeAnalyzer"
}

这可能是由于您使用了" lexicaltokenizername . standard ";Tokenizer,它根据各种分隔符(如'-')分解token。如果您只想分解空格,可以使用whitespace标记器,或者如果您根本不想分解任何语法,可以尝试使用Keyword分析器。

相关内容

  • 没有找到相关文章

最新更新