Lucene.Net v4.8.0-beta0007-自定义StopWord分析器-异常无法从关闭的TextReader



我们正在尝试从v3.0.3转换到v4.8.0-beta0007。Net Framework 4.5。

我们以前有一个自定义StopWords分析器,它继承自Analyzer。升级后,有一个抽象方法需要实现,名为:TokenStreamComponents CreateComponents(字符串字段名称,文本阅读器(

遵循中的文档https://lucenenet.apache.org/download/version-4.html为了实现这个方法,我们得到了一个异常:"无法从关闭的TextReader中读取。">

以下是我们的实现:

protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
Analyzer analyzer = new StandardAnalyzer(_luceneVersion, reader);
TokenStream ts = analyzer.GetTokenStream(fieldName, reader);
var tokenizer = new StandardTokenizer(_luceneVersion, reader);
try
{
ts.Reset(); // Resets this stream to the beginning. (Required)
while (ts.IncrementToken())
{
}
ts.End();   // Perform end-of-stream operations, e.g. set the final offset.
}
catch (Exception ex)
{
_ = ex.Message;
throw;
}
finally
{
ts.Dispose();
}
return new TokenStreamComponents(tokenizer, ts);
}

答案来自https://github.com/apache/lucenenet/issues/246#issuecomment-620808822:

由于CreateComponents()是一个工厂方法(意味着它是一个创造性操作(,因此应该只在那里处理短暂的依赖关系。由于在返回流之前先处理流,因此它不处于CreateComponents()的调用者可以使用它的状态。

要制作自定义的标准分析器,最好的方法是在现有StandardAnalyzer类之后对新类进行建模。

public sealed class MyStopwordAnalyzer : StopwordAnalyzerBase
{
/// <summary>
/// An unmodifiable set containing some common English words that are usually not
/// useful for searching. 
/// </summary>
public static readonly CharArraySet STOP_WORDS_SET = LoadEnglishStopWordsSet();
private static CharArraySet LoadEnglishStopWordsSet() // LUCENENET: Avoid static constructors (see https://github.com/apache/lucenenet/pull/224#issuecomment-469284006)
{
IList<string> stopWords = new string[] { "a", "an", "and", "are", "as", "at", "be",
"but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on",
"or", "such", "that", "the", "their", "then", "there", "these", "they", "this",
"to", "was", "will", "with" };
#pragma warning disable 612, 618
var stopSet = new CharArraySet(LuceneVersion.LUCENE_CURRENT, stopWords, false);
#pragma warning restore 612, 618
return CharArraySet.UnmodifiableSet(stopSet);
}
/// <summary>
/// Builds an analyzer with the given stop words. </summary>
/// <param name="matchVersion"> Lucene compatibility version - See <see cref="MyStopwordAnalyzer"/> </param>
/// <param name="stopWords"> stop words  </param>
public MyStopwordAnalyzer(LuceneVersion matchVersion, CharArraySet stopWords)
: base(matchVersion, stopWords)
{
}
/// <summary>
/// Builds an analyzer with the default stop words (<see cref="STOP_WORDS_SET"/>). </summary>
/// <param name="matchVersion"> Lucene compatibility version - See <see cref="MyStopwordAnalyzer"/> </param>
public MyStopwordAnalyzer(LuceneVersion matchVersion)
: this(matchVersion, STOP_WORDS_SET)
{
}
/// <summary>
/// Builds an analyzer with the stop words from the given reader. </summary>
/// <seealso cref="WordlistLoader.GetWordSet(TextReader, LuceneVersion)"/>
/// <param name="matchVersion"> Lucene compatibility version - See <see cref="MyStopwordAnalyzer"/> </param>
/// <param name="stopwords"> <see cref="TextReader"/> to read stop words from  </param>
public MyStopwordAnalyzer(LuceneVersion matchVersion, TextReader stopwords)
: this(matchVersion, LoadStopwordSet(stopwords, matchVersion))
{
}
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
var src = new StandardTokenizer(m_matchVersion, reader);
TokenStream tok = new StandardFilter(m_matchVersion, src);
// tok = new LowerCaseFilter(m_matchVersion, tok); // optional
tok = new StopFilter(m_matchVersion, tok, m_stopwords);
return new TokenStreamComponents(src, tok);
}
}

请注意,现有的StandardAnalyzer类还允许传入包含停止词的CharArraySet,如果您希望使用LowerCaseFilter来规范化文本,这可能会满足您的需要。

最新更新