Tf-idf矢量器在带有char_wb的特征词中有空格



我使用

singleTFIDF = TfidfVectorizer(
analyzer='char_wb', 
ngram_range=(4,6),
stop_words=my_stop_words, 
max_features=50
).fit([text])

想知道为什么我的功能中有空白,比如"摩擦">

我该如何避免这种情况?我需要自己标记和预处理吗?

使用analyzer='word'

当我们使用analyzer='char_wb'时,矢量器填充空白,因为它不会对单词进行标记化;它对字符进行标记。

根据analyzer论点的文件:

analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

特征应该由单词还是字符组成n-gram选项"char_wb"仅从单词边界内的文本中创建字符n-grams;单词边缘的n-gram用空格填充

查看以下示例:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(
analyzer='char_wb', 
ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

[(4,'和'),(5,'和’),(4,'doc'),"5,'docu'),"(6,'docum'),(4,"fir"),(5,"firs"),(5,'one.'),(6,'one'.'),(4,‘the’),(5,‘the‘),(4,’thi‘),(5,'this'),(6,'this'),(4,'和'),(4,"cume"),(5,"cumen"),(6,'docume'),(4,'econ'),(4,'ent.'),(5,'ent.'),"first"),(6,"first’),(4,"hird"),ird),(4,irst),(5,irst,"ment。"),(6,"ment"),(5,"ment?"),'nt'),(4,'nt?'),(4,`ocum`),(5,`ocume`),"ond"),(4,"one."),"secon"),(6,"second"),"third"),(4,"this"),"纪念碑"),(6,"纪念碑">

注意:

  • 输出/特征包括' this'(在开头加上原文中没有的额外空格;句子以'This'开头)
  • 输出/特征包括'ment. '(在末尾加上原文中没有的额外空格;句子以'document.'结尾)
  • 输出/特征不包括'is the',因为该n-gram跨越单词边界,但是'char_wb'分析器仅创建n-gram">内部单词边界">

相关内容

  • 没有找到相关文章

最新更新