SOLR eDismax拼写错误容忍度



如何构建查询来搜索确切的短语以及带有一些拼写错误的短语?我被困在这个上面,看起来我走错了方向。

例如,我的 edismax 查询中有下一个字段:

q=apple iphone

它有效,但现在我需要让它对错别字更宽容。我更新了我的查询,现在即使用户键入错误,它也返回与以前相同的结果:

q=aple~2 iphane~2

接下来,我发现现在确切的查询匹配并不总是在第一页上(例如,我真的有产品"aple iphane"(。因此,我使用"OR"条件添加确切的查询。现在我的查询看起来像

q=(aple~2 iphane~2) OR 'aple iphane'^3

问题是,它现在只返回完全匹配,并且不返回模糊条目。我做错了什么?

以下是完整查询:

http://localhost:8983/solr/test/select?omitHeader=true
&q=(aple~2 iphane~2) OR 'aple iphane'^3
&start=0
&rows=30
&fl=*,score
&fq=itemType:"Product"
&defType=edismax
&qf=title_de^1000 title_de_ranked^1000 description_de^1 category_name_de^50 brand^15 merchant_name^80 uniuque_values^10000 searchable_attribute_product.name^1000 searchable_attribute_product.description.short^100 searchable_attribute_product.description.long^100 searchable_attribute_mb.book.author^500
&mm=90
&pf=title_de^2000 description_de^2
&ps=1
&qs=2
&boost=category_boost
&mm.autoRelax=true
&wt=json
&json.nl=flat

我的查询有错误,或者我选择的方式完全错误吗?

我想在"title_de"中找到这句话,所有其他字段都是次要的。这是来自我的架构的字段类型:

<fieldType name="text_de_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="25"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" />
</analyzer>
</fieldType>

谢谢!


上级:我发现我的查询(q=(aple~2 iphane~2(或'aple iphane'^3(不正确,所以我找到了如何构建其他2个查询,效果更好,你可以在帖子末尾看到它们。我仍然不知道为什么它们会给出不同的结果,因为 SOLR 查询的默认运算符是"OR",因此"术语 1 或术语 2 或术语 3 或术语 4"应与"(术语 1 或术语 2(或(术语 3 或术语 4(相同。
正如@Persimmonium所建议的,我添加了一些调试示例来显示 edismax 的模糊查询工作(但并不总是预期的(。我发现"苹果iPhone"在我的大型德语索引中并不是最好的例子,所以我使用了名为"Samsung Magic Info-Lite"的产品作为例子

以下是我查询的所有参数:

"params":{
"mm":"100%",
"q":"samsung magic",
"defType":"edismax",
"indent":"on",
"qf":"title_de",
"fl":"*,score",
"pf":"title_de",
"wt":"json",
"debugQuery":"on",
"_":"1501409530601"
}

因此,此查询向我返回正确的产品(我有 6 个产品,title_de字段中都有这两个词(。 在我为这两个词添加错别字之后:

"q":"somsung majic"

未找到产品。

然后我在这两个词中添加模糊运算符:

"q":"somsung~2 majic~2"

找到 6 件商品。下面是调试查询结果:

"debug":{
"rawquerystring":"somsung~2 majic~2",
"querystring":"somsung~2 majic~2",
"parsedquery":"(+(DisjunctionMaxQuery((title_de:somsung~2)) DisjunctionMaxQuery((title_de:majic~2)))~2 DisjunctionMaxQuery((title_de:"somsung 2 majic 2")))/no_coord",
"parsedquery_toString":"+(((title_de:somsung~2) (title_de:majic~2))~2) (title_de:"somsung 2 majic 2")",
"explain":{
"69019":"n1.3424492 = sum of:n  1.3424492 = sum of:n    1.1036766 = sum of:n      0.26367697 = weight(title_de:amsung in 305456) [ClassicSimilarity], result of:n        0.26367697 = score(doc=305456,freq=1.0), product of:n          0.073149204 = queryWeight, product of:n            0.6666666 = boostn            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.015219777 = queryNormn          3.604646 = fieldWeight in 305456, product of:n            1.0 = tf(freq=1.0), with freq of:n              1.0 = termFreq=1.0n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.5 = fieldNorm(doc=305456)n      0.2373093 = weight(title_de:msung in 305456) [ClassicSimilarity], result of:n        0.2373093 = score(doc=305456,freq=1.0), product of:n          0.06583429 = queryWeight, product of:n            0.6 = boostn            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.015219777 = queryNormn          3.604646 = fieldWeight in 305456, product of:n            1.0 = tf(freq=1.0), with freq of:n              1.0 = termFreq=1.0n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.5 = fieldNorm(doc=305456)n      0.26367697 = weight(title_de:samsun in 305456) [ClassicSimilarity], result of:n        0.26367697 = score(doc=305456,freq=1.0), product of:n          0.073149204 = queryWeight, product of:n            0.6666666 = boostn            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.015219777 = queryNormn          3.604646 = fieldWeight in 305456, product of:n            1.0 = tf(freq=1.0), with freq of:n              1.0 = termFreq=1.0n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.5 = fieldNorm(doc=305456)n      0.33901328 = weight(title_de:samsung in 305456) [ClassicSimilarity], result of:n        0.33901328 = score(doc=305456,freq=1.0), product of:n          0.094048984 = queryWeight, product of:n            0.85714287 = boostn            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.015219777 = queryNormn          3.604646 = fieldWeight in 305456, product of:n            1.0 = tf(freq=1.0), with freq of:n              1.0 = termFreq=1.0n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              635.0 = docFreqn              316313.0 = docCountn            0.5 = fieldNorm(doc=305456)n    0.23877257 = sum of:n      0.23877257 = weight(title_de:magic in 305456) [ClassicSimilarity], result of:n        0.23877257 = score(doc=305456,freq=1.0), product of:n          0.0762529 = queryWeight, product of:n            0.8 = boostn            6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              1638.0 = docFreqn              316313.0 = docCountn            0.015219777 = queryNormn          3.1313245 = fieldWeight in 305456, product of:n            1.0 = tf(freq=1.0), with freq of:n              1.0 = termFreq=1.0n            6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n              1638.0 = docFreqn              316313.0 = docCountn            0.5 = fieldNorm(doc=305456)n",
},
"QParser":"ExtendedDismaxQParser"
}

这种行为让我很满意,直到我没有真正的产品名称为"Somsung majic"。这是理论上的情况,但在实践中,这种模糊运算符会导致许多其他不正确的搜索结果。

因此,为了处理这些事情,我的想法是按照我最初描述的那样,添加带有提升因子的确切条目(没有模糊修饰符(。所以,现在的问题是,它将如何更好地实施。我发现,如果我减少 mm 参数,这个查询的工作原理可以接受:

"q":"somsung~2 majic~2 somsung^3 majic^3"

这是因为我在查询中添加了更多单词,因此"最小应匹配"也需要减少。问题是,在具有确切标题条目的长标题上,我得到的"mm"减小的结果很糟糕(由于其他因素,一些错误的项目可能会排名更高(。这是它的调试:

"debug":{
"rawquerystring":"somsung~2 majic~2 somsung^3 majic^3",
"querystring":"somsung~2 majic~2 somsung^3 majic^3",
"parsedquery":"(+(DisjunctionMaxQuery((title_de:somsung~2)) DisjunctionMaxQuery((title_de:majic~2)) DisjunctionMaxQuery((title_de:somsung))^3.0 DisjunctionMaxQuery((title_de:majic))^3.0)~2 DisjunctionMaxQuery((title_de:"somsung 2 majic 2 somsung 3 majic 3")))/no_coord",
"parsedquery_toString":"+(((title_de:somsung~2) (title_de:majic~2) ((title_de:somsung))^3.0 ((title_de:majic))^3.0)~2) (title_de:"somsung 2 majic 2 somsung 3 majic 3")",
"explain":{
"69019":"n0.3418829 = sum of:n  0.3418829 = product of:n    0.6837658 = sum of:n      0.5621489 = sum of:n        0.13430178 = weight(title_de:amsung in 305456) [ClassicSimilarity], result of:n          0.13430178 = score(doc=305456,freq=1.0), product of:n            0.037257966 = queryWeight, product of:n              0.6666666 = boostn              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.0077520725 = queryNormn            3.604646 = fieldWeight in 305456, product of:n              1.0 = tf(freq=1.0), with freq of:n                1.0 = termFreq=1.0n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.5 = fieldNorm(doc=305456)n        0.12087161 = weight(title_de:msung in 305456) [ClassicSimilarity], result of:n          0.12087161 = score(doc=305456,freq=1.0), product of:n            0.033532172 = queryWeight, product of:n              0.6 = boostn              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.0077520725 = queryNormn            3.604646 = fieldWeight in 305456, product of:n              1.0 = tf(freq=1.0), with freq of:n                1.0 = termFreq=1.0n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.5 = fieldNorm(doc=305456)n        0.13430178 = weight(title_de:samsun in 305456) [ClassicSimilarity], result of:n          0.13430178 = score(doc=305456,freq=1.0), product of:n            0.037257966 = queryWeight, product of:n              0.6666666 = boostn              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.0077520725 = queryNormn            3.604646 = fieldWeight in 305456, product of:n              1.0 = tf(freq=1.0), with freq of:n                1.0 = termFreq=1.0n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.5 = fieldNorm(doc=305456)n        0.17267373 = weight(title_de:samsung in 305456) [ClassicSimilarity], result of:n          0.17267373 = score(doc=305456,freq=1.0), product of:n            0.047903106 = queryWeight, product of:n              0.85714287 = boostn              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.0077520725 = queryNormn            3.604646 = fieldWeight in 305456, product of:n              1.0 = tf(freq=1.0), with freq of:n                1.0 = termFreq=1.0n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                635.0 = docFreqn                316313.0 = docCountn              0.5 = fieldNorm(doc=305456)n      0.12161691 = sum of:n        0.12161691 = weight(title_de:magic in 305456) [ClassicSimilarity], result of:n          0.12161691 = score(doc=305456,freq=1.0), product of:n            0.038838807 = queryWeight, product of:n              0.8 = boostn              6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                1638.0 = docFreqn                316313.0 = docCountn              0.0077520725 = queryNormn            3.1313245 = fieldWeight in 305456, product of:n              1.0 = tf(freq=1.0), with freq of:n                1.0 = termFreq=1.0n              6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                1638.0 = docFreqn                316313.0 = docCountn              0.5 = fieldNorm(doc=305456)n    0.5 = coord(2/4)n"
},
"QParser":"ExtendedDismaxQParser"
}

此查询甚至适用于大的"mm"参数(例如,90%(:

"q":"(somsung~2 majic~2) OR (somsung^3 majic^3)"

但这里的问题是我得到了 430 个结果(而不是想要的 6 个(。 以下是带有错误产品示例的调试:

"debug":{
"rawquerystring":"(somsung~2 majic~2) OR (somsung^3 majic^3)",
"querystring":"(somsung~2 majic~2) OR (somsung^3 majic^3)",
"parsedquery":"(+((DisjunctionMaxQuery((title_de:somsung~2)) DisjunctionMaxQuery((title_de:majic~2))) (DisjunctionMaxQuery((title_de:somsung))^3.0 DisjunctionMaxQuery((title_de:majic))^3.0))~1 DisjunctionMaxQuery((title_de:"somsung 2 majic 2 somsung 3 majic 3")))/no_coord",
"parsedquery_toString":"+((((title_de:somsung~2) (title_de:majic~2)) (((title_de:somsung))^3.0 ((title_de:majic))^3.0))~1) (title_de:"somsung 2 majic 2 somsung 3 majic 3")",
"explain":{
"113746":"n0.1275867 = sum of:n  0.1275867 = product of:n    0.2551734 = sum of:n      0.2551734 = product of:n        0.5103468 = sum of:n          0.5103468 = sum of:n            0.26860356 = weight(title_de:losung in 296822) [ClassicSimilarity], result of:n              0.26860356 = score(doc=296822,freq=1.0), product of:n                0.037257966 = queryWeight, product of:n                  0.6666666 = boostn                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                    635.0 = docFreqn                    316313.0 = docCountn                  0.0077520725 = queryNormn                7.209292 = fieldWeight in 296822, product of:n                  1.0 = tf(freq=1.0), with freq of:n                    1.0 = termFreq=1.0n                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                    635.0 = docFreqn                    316313.0 = docCountn                  1.0 = fieldNorm(doc=296822)n            0.24174322 = weight(title_de:osung in 296822) [ClassicSimilarity], result of:n              0.24174322 = score(doc=296822,freq=1.0), product of:n                0.033532172 = queryWeight, product of:n                  0.6 = boostn                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                    635.0 = docFreqn                    316313.0 = docCountn                  0.0077520725 = queryNormn                7.209292 = fieldWeight in 296822, product of:n                  1.0 = tf(freq=1.0), with freq of:n                    1.0 = termFreq=1.0n                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:n                    635.0 = docFreqn                    316313.0 = docCountn                  1.0 = fieldNorm(doc=296822)n        0.5 = coord(1/2)n    0.5 = coord(1/2)n"
},
"QParser":"ExtendedDismaxQParser"
}

所以,虽然我得到了更好的结果,但我仍然需要改进搜索,我仍然不知道选择哪种方式以及为什么我会得到这样的结果。

我认为 edismax 不支持模糊运算符 ~。这里有一个很长的历史补丁,开发人员已经在生产中使用了很长时间,但它还没有进入Solr代码库。

edismax适用于fuzzy,但是当你包括mm=90时,你基本上是在说solr应该匹配90个确切的短语。太高了!

删除它或使用像 50% 这样的低百分比将允许一些模糊性发挥作用

相关内容

  • 没有找到相关文章

最新更新