我应该如何使用scikit learn对以下列表列表进行矢量化

我想用scikit学习一个有列表的列表进行矢量化。我走到我有培训文本的路径，我阅读了它们，然后我得到了这样的东西：

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word')
vect_representation= vect.fit_transform(corpus)
print vect_representation.toarray()

我得到以下信息：

return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

另一个问题是每个文档末尾的标签，我应该如何处理它们才能进行正确的分类？

对于未来的每个人来说，这解决了我的问题：

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(splited_labels_from_corpus)

这是我使用 .toarray() 函数时的输出：

[[0 0 1]
 [1 0 0]
 [0 1 0]]

谢谢大家

首先，您应该将标签与文本分开。如果你想使用CountVectorizer，你必须一个接一个地转换你的文本：

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
... split labels from texts
vect = CountVectorizer(analyzer='word')
vect_representation= map(vect.fit_transform,corpus)
...

作为另一种选择，您可以直接将TfidfVectorizer与列表列表一起使用。

相关内容

最新更新

热门标签：