为什么SQL Server全文检索索引SCR或SUR首字母缩略词后面跟着一个数字,一起?



我发现了SQL Server全文搜索的一个非常奇怪的行为,它是索引SUR, SCR和可能的一些其他首字母缩略词,以及后面的数字-作为"精确匹配"。

SELECT * FROM sys.dm_fts_parser ('"SUR 12345"', 1033, 0, 0)
tbody> <<tr>
keywordgroup_idphrase_idoccurrencespecial_termdisplay_termexpansion_typesource_term
s u r 1 2 3 4 5101精确匹配苏尔123450苏尔12345
n n 1 2 3 4 5 s u r101精确匹配nn12345sur0苏尔12345

我看你的案子很有趣。经过一番工作和研究,我发现一些东西可能对你有帮助。

问题是1或2个字符是一个停止词,所以全文索引搜索引擎在用户搜索包含它的短语时跳过它。解决方案:

1) Drop all full text indexes and full text catalog
2)Create a stop list and set in the stoplist Action=Delete All Stopwords and Full-Text Language=English
3)Create all full text indexes and full text catalog through scripts as given below instead of wizard
4)Attach stoplist to the full text indexes when creating them

/*******************Drop and create FULL TEXT CATALOG for AbstractSearch*************************************/
/****** Drop:  FullTextCatalog [abstractSearch]    Script Date: 12/02/2011 13:10:21 ******/
GO
IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Abstract]'))
ALTER FULLTEXT INDEX ON [dbo].[Abstract] DISABLE
GO
/****** Object:  FullTextIndex     Script Date: 12/02/2011 13:10:21 ******/
IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Abstract]'))
DROP FULLTEXT INDEX ON [dbo].[Abstract]
GO
IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Author]'))
ALTER FULLTEXT INDEX ON [dbo].[Author] DISABLE
GO
/****** Object:  FullTextIndex     Script Date: 12/02/2011 13:10:21 ******/
IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Author]'))
DROP FULLTEXT INDEX ON [dbo].[Author]
GO
IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Synonym]'))
ALTER FULLTEXT INDEX ON [dbo].[Synonym] DISABLE
GO
/****** Object:  FullTextIndex     Script Date: 12/02/2011 13:10:21 ******/
IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Synonym]'))
DROP FULLTEXT INDEX ON [dbo].[Synonym]
GO
IF  EXISTS (SELECT * FROM sysfulltextcatalogs ftc WHERE ftc.name = N'abstractSearch')
DROP FULLTEXT CATALOG [abstractSearch]
GO

/****** Create:  FullTextCatalog [abstractSearch]    Script Date: 12/02/2011 13:10:21 ******/
CREATE FULLTEXT CATALOG [abstractSearch]
AS DEFAULT
GO
/****** Create:  FullTextIndex on  Abstract with stoplist set to custom stoplist   Script Date: 12/02/2011 13:10:21 ******/
CREATE FULLTEXT INDEX ON dbo.Abstract
(abstractTitle, abstractDescription)
KEY INDEX PK_Abstract
ON [abstractSearch]
WITH STOPLIST = [AbstractSearchStopList]

/******  Create:  FullTextIndex on  Synonym with stoplist set to custom stoplist    Script Date: 12/02/2011 13:10:21 ******/
CREATE FULLTEXT INDEX ON dbo.Synonym
(synonyms,keywordSynonym)
KEY INDEX PK_Synonyms
ON [abstractSearch]
WITH STOPLIST = [AbstractSearchStopList]

/******  Create:  FullTextIndex on  Author with stoplist set to custom stoplist   Script Date: 12/02/2011 13:10:21 ******/
CREATE FULLTEXT INDEX ON dbo.Author
(firstName,lastName,middleName)
KEY INDEX PK_Author
ON [abstractSearch]
WITH STOPLIST = [AbstractSearchStopList]

最后我能够确定问题与货币符号(显然SUR和SCR是货币符号)后面或前面有一个数字有关,导致两者被索引在一起。

在我看来,这可能是一个理想的行为,只有当用户希望过去(SUR -苏联卢布,自1993年以来没有使用)或当前(SCR -塞舌尔卢比)货币出现在文本中,并且只有货币符号按照标准在数字后面或前面(例如$在数字前面,SCR或€在数字后面)。

此外,货币符号似乎部分地影响了中性语言断行符——过去的货币如SUR是好的,但当前的货币影响语言中性的断行符是一个完全出乎意料的行为,因为语言中性文本处理不应该受到任何字典单词的影响。

SQL Server 2012及以上版本的Microsoft文档FTS文本处理解释了对断字符的相关更改,显示新的断字符不会单独索引货币符号或数字,即使在语言中性的断字符中也是如此:

<表类>词以前新tbody><<tr>100100100100nn100nn100usd100美元000美元100100美元000美元100美元000美元000100美元000美元nn000100美元000美元nn100 $100美元000美元美元

相关内容

最新更新