如果术语太短,Lucene Net搜索失败



我是Lucene的新手,所以也许这是我不理解的技术限制。

我索引了一些文本,并尝试获取内容。如果我用查询source查询这个文本open-source reciprocal productivity,我得到一个匹配。如果我查询sour,我也会得到一个匹配。但是如果我使用查询sou,那么我没有得到任何匹配。

我正在使用Lucene .Net版本4.8下面是我用来创建索引的代码:

using (var dir = FSDirectory.Open(targetDirectory))
{
Analyzer analyzer = metadata.GetAnalyzer() ; //return new StandardAnalyzer(LuceneVersion.LUCENE_48);

var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
using (IndexWriter writer = new IndexWriter(dir, indexConfig))
{
long entryNumber = csvRecords.Count();
long index = 0;
long lastPercentage = 0;
foreach (dynamic csvEntry in csvRecords)
{
Document doc = new Document();
IDictionary<string, object> dynamicCsvEntry = (IDictionary<string, object>)csvEntry;
var indexedMetadataFiled = metadata.IdexedFields;
foreach (string headField in header)
{
if (indexedMetadataFiled.ContainsKey(headField) == false || (indexedMetadataFiled[headField].NeedToBeIndexed == false && indexedMetadataFiled[headField].NeedToBeStored == false))
continue;
var field = new Field(headField,
((string)dynamicCsvEntry[headField] ?? string.Empty).ToLower(),
indexedMetadataFiled[headField].NeedToBeStored ? Field.Store.YES : Field.Store.NO, //YES
indexedMetadataFiled[headField].NeedToBeIndexed ? Field.Index.ANALYZED : Field.Index.NO //YES
);
doc.Add(field);
}
long percentage = (long)(((decimal)index / (decimal)entryNumber) * 100m);
if ( percentage > lastPercentage && percentage % 10 == 0)
{
_consoleLogger.Information($"..indexing {percentage}%..");
lastPercentage = percentage;
}

writer.AddDocument(doc);
index++;
}
writer.Commit();
}
}

下面是用来查询索引的代码:

var tokens = Regex.Split(query.Trim(), @"W+");
BooleanQuery composedQuery = new BooleanQuery();
foreach (var field in luceneHint.FieldsToSearch)
{
foreach (string word in tokens)
{
if (string.IsNullOrWhiteSpace(word))
continue;
var termQuery = new FuzzyQuery(new Term(field.FieldName, word.ToLower() ));
termQuery.Boost = (float)field.Weight;
composedQuery.Add(termQuery, Occur.SHOULD);
}
}
var indexManager = IndexManager.Instance;
ReferenceManager<IndexSearcher> index = indexManager.Read(boundle);
int resultLimit = luceneHint?.Top ?? RESULT_LIMIT;
var results = new List<JObject>();
var searcher = index.Acquire();
try
{
Dictionary<string, FieldDescriptor> filedToRead = (luceneHint?.FieldsToRead?.Any() ?? false) ?
luceneHint.FieldsToRead.ToDictionary(item => item.FieldName, item => item) :
new Dictionary<string, FieldDescriptor>();
bool fetchEveryField = filedToRead.Count == 0;
TopScoreDocCollector collector = TopScoreDocCollector.Create(resultLimit, true);
int startPageIndex = pageIndex * itemsPerPage;
searcher.Search(composedQuery, collector);
//TopDocs topDocs = searcher.Search(composedQuery, luceneHint?.Top ?? 100);
TopDocs topDocs = collector.GetTopDocs(startPageIndex, itemsPerPage);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
Document doc = searcher.Doc(scoreDoc.Doc);
dynamic result = new JObject();
foreach (var field in doc.Fields)
if (fetchEveryField || filedToRead.ContainsKey(field.Name))
result[field.Name] = field.GetStringValue();
results.Add(result);
}
}
finally
{
if ( searcher != null )
index.Release(searcher);
}
return results;

我很困惑,事实是我不能得到sou查询的结果与用于建立索引的StandardAnalyzer有关,使用一些阻止我的查询词在索引中找到的停止词?(索引停止sourcesour,因为它们都是英文单词)

Ps:这里是解释情节,即使我不知道如何使用它:

searcher.Explain(composedQuery,6) {0 = (NON-MATCH) sum of:}描述:"总和;"IsMatch:假匹配:假值:0

FuzzyQuery的文档指出它使用默认的minimumSimilarity值0.5:https://lucenenet.apache.org/docs/3.0.3/d0/db9/class_lucene_1_1_net_1_1_search_1_1_fuzzy_query.html

minimumSimilarity—取值范围为0 ~ 1,用于设置查询词与匹配词之间所需的相似度。例如,对于minimumSimilarity为0.5的情况,如果两个词之间的编辑距离小于length(term) * 0.5,则认为与查询词长度相同的词与查询词相似。

所以,它匹配"source"当查询是"酸的"时,因为删除了"酸的"。需要两次编辑,编辑距离为2,也就是<= than length("sour") * 0.5。然而,匹配的"来源";sou"需要3次编辑,所以不匹配。

你应该能够看到相同的文档匹配,即使你搜索像"bounce"或者"sauce",因为它们也在"source"的两个编辑范围内。

最新更新