Tf-idf矢量器在带有char_wb的特征词中有空格

我使用

singleTFIDF = TfidfVectorizer(
analyzer='char_wb', 
ngram_range=(4,6),
stop_words=my_stop_words, 
max_features=50
).fit([text])

想知道为什么我的功能中有空白，比如"摩擦">

我该如何避免这种情况？我需要自己标记和预处理吗？

使用analyzer='word'。

当我们使用analyzer='char_wb'时，矢量器填充空白，因为它不会对单词进行标记化；它对字符进行标记。

根据analyzer论点的文件：

analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

特征应该由单词还是字符组成n-gram选项"char_wb"仅从单词边界内的文本中创建字符n-grams；单词边缘的n-gram用空格填充。

查看以下示例：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(
analyzer='char_wb', 
ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

[(4，'和')，(5，'和’)，(4，'doc')，"5，'docu')，"(6，'docum')，(4，"fir")，(5，"firs")，(5，'one.')，(6，'one'.')，(4，‘the’)，(5，‘the‘)，(4，’thi‘)，(5，'this')，(6，'this')，(4，'和')，(4，"cume")，(5，"cumen")，(6，'docume')，(4，'econ')，(4，'ent.')，(5，'ent.')，"first")，(6，"first’)，(4，"hird")，ird)，(4，irst)，(5，irst，"ment。")，(6，"ment")，(5，"ment？")，'nt')，(4，'nt？')，(4，`ocum`)，(5，`ocume`)，"ond")，(4，"one.")，"secon")，(6，"second")，"third")，(4，"this")，"纪念碑")，(6，"纪念碑">

注意：

输出/特征包括' this'(在开头加上原文中没有的额外空格；句子以'This'开头)
输出/特征包括'ment. '(在末尾加上原文中没有的额外空格；句子以'document.'结尾)
输出/特征不包括'is the'，因为该n-gram跨越单词边界，但是'char_wb'分析器仅创建n-gram">内部单词边界">

相关内容

最新更新

热门标签：