Sklearn:希望将CountVectorizer扩展到与词汇的模糊匹配

我将尝试使用FuzzyWuzzy使用调谐的分数参数本质上，它将检查单词是否在词汇中，如果没有，则会要求FuzzyWuzzy选择最佳模糊匹配，并在至少有一定的分数中接受令牌。

如果这不是处理相当多的错别字的最佳方法，并且拼写略有不同，但是类似的单词，我愿意建议。

问题是子类不断抱怨它具有空的词汇，这没有任何意义，就像我在代码的同一部分使用常规计数矢量器时，它可以正常工作。

它吐出许多错误：valueerror：空词汇；也许文件仅包含停止单词

我想念什么？我还没有做任何特别的事情。它应该像正常人一样工作：

class FuzzyCountVectorizer(CountVectorizer):
    def __init__(self, input='content', encoding='utf-8', decode_error='strict',
                 strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,
                 token_pattern="(?u)bww+b", ngram_range=(1, 1), analyzer='word',
                 max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,
                 dtype=numpy.int64, min_fuzzy_score=80):
        super().__init__(
            input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents,
            lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, stop_words=stop_words,
            token_pattern=token_pattern, ngram_range=ngram_range, analyzer=analyzer, max_df=max_df,
            min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)
        # self._trained = False
        self.min_fuzzy_score = min_fuzzy_score
    @staticmethod
    def remove_non_alphanumeric_chars(s: str) -> 'str':
        pass
    @staticmethod
    def tokenize_text(s: str) -> 'List[str]':
        pass
    def fuzzy_repair(self, sl: 'List[str]') -> 'List[str]':
        pass
    def fit(self, raw_documents, y=None):
        print('Running FuzzyTokenizer Fit')
        #TODO clean up input
        super().fit(raw_documents=raw_documents, y=y)
        self._trained = True
        return self
    def transform(self, raw_documents):
        print('Running Transform')
        #TODO clean up input
        #TODO fuzzyrepair
        return super().transform(raw_documents=raw_documents)

scikit-learn的CountVectorizer的原始功能定义

token_pattern=r"(?u)bww+b"

在您的子类中，您不使用Escape r字符串前缀，因此此问题。另外，不用复制所有__init__参数，而是更容易使用，

def __init__(self, *args, **kwargs):
     self.min_fuzzy_score = kwargs.pop('min_fuzzy_score', 80)
     super().__init__(*args, **kwargs)

这是否是最佳方法，它取决于数据集的大小。对于具有N_words总数和N_vocab_size词汇大小的文档设置，此方法需要O(N_words*N_vocab_size)挑剔的单词比较。如果您使用标准CountVectorizer矢量化数据集，则通过模糊匹配减少了计算的词汇（和包od单词矩阵），但它将仅需要"仅" O(N_vocab_size**2)比较。

对于词汇量来说，这可能仍然不错，超过了几个10,000个单词。如果您打算在生成的稀疏阵列上应用某些机器学习算法，则可能还需要尝试字符n-grams，这对印刷错误也有点强大。

相关内容

最新更新

热门标签：