在Solr(4.10.3)中,我有一个查询(不使用dismax或edismax)
t:"past surgical cardiovascular system"
查询调试输出
"rawquerystring": "t:"past surgical cardiovascular system"",
"querystring": "t:"past surgical cardiovascular system"",
"parsedquery": "MultiPhraseQuery(t:"(ex former formerly previous prior past) (surgery surg surgical operative)")",
"parsedquery_toString": "t:"(ex former formerly previous prior past) (surgery surg surgical operative)"",
似乎solr从第三位置开始完全忽略了令牌。我有点懊恼,因为这是我在 8 小时调查后第一次注意到这一点。我错过了什么?如何强制 solr 考虑第三个和第四个令牌?
如果有帮助,t 字段的类型为:
<fieldType name="text_en_splitting" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<!-- <tokenizer class="solr.WhitespaceTokenizerFactory" /> -->
<tokenizer class="solr.PatternTokenizerFactory" pattern="s*[{}[]|():;,]s*|b[-/+]b|s+[&+-]s+|(?:b')?s+|.(?=z|s)" />
<!-- in this example, we will only use synonyms at query time <filter
class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/> -->
<!-- Case insensitive stop word removal. add enablePositionIncrements=true
in both the index and query analyzers to leave a 'gap' for more accurate
phrase queries. -->
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10"/>
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPossessiveFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory" />
</analyzer>
<analyzer type="query">
<!-- <tokenizer class="solr.WhitespaceTokenizerFactory" /> -->
<tokenizer class="solr.PatternTokenizerFactory" pattern="s*[{}[]|():;,]s*|b[-/+]b|s+[&+-]s+|(?:b')?s+|.(?=z|s)" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10"/>
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPossessiveFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory" />
<!-- <filter class="solr.PorterStemFilterFactory" /> -->
</analyzer>
</fieldType>
我认为 solr 中的某个地方有一个错误。
我运行了一个不同的查询,并在解析的查询中获得了所有令牌:
"rawquerystring": "t:"acute myocardial infarction surgical"",
"querystring": "t:"acute myocardial infarction surgical"",
"parsedquery": "MultiPhraseQuery(t:"(acute aqt) (myocardial myocrd) (infarct infarction nfrct) (surgery surg surgical)")",
"parsedquery_toString": "t:"(acute aqt) (myocardial myocrd) (infarct infarction nfrct) (surgery surg surgical)"",
如果我在查询前面加上"过去",那么 tokes 就会被删除
"rawquerystring": "t:"past acute myocardial infarction surgical"",
"querystring": "t:"past acute myocardial infarction surgical"",
"parsedquery": "MultiPhraseQuery(t:"(ex former formerly previous prior past) (acute aqt) (myocardial myocrd)")",
"parsedquery_toString": "t:"(ex former formerly previous prior past) (acute aqt) (myocardial myocrd)"",
分析页面没有给我太多细节,因为它独立分析令牌
我终于发现了问题:在使用同义词扩展后,我正在使用solr.LimitTokenCountFilterFactory
将查询限制为 10 个令牌。解决方案是删除此过滤器
您有一个极其复杂的查询分析器链。幸运的是,您可以使用 Web 管理 UI 中的"分析"屏幕准确了解其中发生的情况。
因此,您可以将短语放在那里(在查询处理的右侧),并逐步查看单词会发生什么。
例如,这应该告诉您某些术语是否在其中一层中被意外吞噬。