将令牌传递给 CountVectorizer

我有一个文本分类问题，我有两种类型的特征：

n-gram的特征（由CountVectorizer提取）
其他文本特征（例如，给定词典中存在单词）。这些特征与 n 元语法不同，因为它们应该是从文本中提取的任何 n-gram 的一部分。

这两种类型的要素都是从文本的标记中提取的。我只想运行一次标记化，然后将这些令牌传递给 CountVectorizer 和其他存在特征提取器。因此，我想将令牌列表传递给 CountVectorizer，但只接受字符串作为某些样本的表示形式。有没有办法传递令牌数组？

总结@user126350和@miroli的答案以及此链接：

from sklearn.feature_extraction.text import CountVectorizer
def dummy(doc):
    return doc
cv = CountVectorizer(
        tokenizer=dummy,
        preprocessor=dummy,
    )  
docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]
cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']

要记住的一件事是在调用 transform（）函数之前将新的标记化文档包装到列表中，以便将其作为单个文档处理，而不是将每个标记解释为文档：

new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])
v_1.shape
# (4, 4)
v_2.shape
# (1, 4)

通常，您可以将自定义tokenizer参数传递给CountVectorizer。分词器应该是一个函数，它接受一个字符串并返回其标记的数组。但是，如果您已经在数组中拥有令牌，则可以简单地使用一些任意键创建令牌数组的字典，并让分词器从该字典返回。然后，当您运行 CountVectorizer 时，只需传递字典键即可。例如

 # arbitrary token arrays and their keys
 custom_tokens = {"hello world": ["here", "is", "world"],
                  "it is possible": ["yes it", "is"]}
 CV = CountVectorizer(
      # so we can pass it strings
      input='content',
      # turn off preprocessing of strings to avoid corrupting our keys
      lowercase=False,
      preprocessor=lambda x: x,
      # use our token dictionary
      tokenizer=lambda key: custom_tokens[key])
 CV.fit(custom_tokens.keys())

类似于user126350的答案，但更简单，这就是我所做的。

def do_nothing(tokens):
    return tokens
pipe = Pipeline([
    ('tokenizer', MyCustomTokenizer()),
    ('vect', CountVectorizer(tokenizer=do_nothing,
                             preprocessor=None,
                             lowercase=False))
])
doc_vects = pipe.transform(my_docs)  # pass list of documents as strings

相关内容

最新更新

热门标签：