我使用
singleTFIDF = TfidfVectorizer(
analyzer='char_wb',
ngram_range=(4,6),
stop_words=my_stop_words,
max_features=50
).fit([text])
想知道为什么我的功能中有空白,比如"摩擦">
我该如何避免这种情况?我需要自己标记和预处理吗?
使用analyzer='word'
。
当我们使用analyzer='char_wb'
时,矢量器填充空白,因为它不会对单词进行标记化;它对字符进行标记。
根据analyzer
论点的文件:
analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’
特征应该由单词还是字符组成n-gram选项"char_wb"仅从单词边界内的文本中创建字符n-grams;单词边缘的n-gram用空格填充。
查看以下示例:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(
analyzer='char_wb',
ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])
[(4,'和'),(5,'和’),(4,'doc'),"5,'docu'),"(6,'docum'),(4,"fir"),(5,"firs"),(5,'one.'),(6,'one'.'),(4,‘the’),(5,‘the‘),(4,’thi‘),(5,'this'),(6,'this'),(4,'和'),(4,"cume"),(5,"cumen"),(6,'docume'),(4,'econ'),(4,'ent.'),(5,'ent.'),"first"),(6,"first’),(4,"hird"),ird),(4,irst),(5,irst,"ment。"),(6,"ment"),(5,"ment?"),'nt'),(4,'nt?'),(4,`ocum`),(5,`ocume`),"ond"),(4,"one."),"secon"),(6,"second"),"third"),(4,"this"),"纪念碑"),(6,"纪念碑">
注意:
- 输出/特征包括
' this'
(在开头加上原文中没有的额外空格;句子以'This'
开头) - 输出/特征包括
'ment. '
(在末尾加上原文中没有的额外空格;句子以'document.'
结尾) - 输出/特征不包括
'is the'
,因为该n-gram跨越单词边界,但是'char_wb'
分析器仅创建n-gram">内部单词边界">