我想将以下短语更改为sklearn的向量:
Article 1. It is not good to eat pizza after midnight
Article 2. I wouldn't survive a day withouth stackexchange
Article 3. All of these are just random phrases
Article 4. To prove if my experiment works.
Article 5. The red dog jumps over the lazy fox
我有以下代码:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
n=0
while n < 5:
n = n + 1
a = ('Article %(number)s' % {'number': n})
print(a)
with open("LISR2.txt") as openfile:
for line in openfile:
if a in line:
X=line
print(vectorizer.fit_transform(X))
给我以下错误:
ValueError: Iterable over raw text documents expected, string object received.
为什么会发生这种情况?我知道这应该起作用,因为如果我单独输入:
X=("It is not good to eat pizza","I wouldn't survive a day", "All of these")
print(vectorizer.fit_transform(X))
它给了我所需的向量。
(0, 8) 1
(0, 2) 1
(0, 11) 1
(0, 3) 1
(0, 6) 1
(0, 4) 1
(0, 5) 1
(1, 1) 1
(1, 9) 1
(1, 12) 1
(2, 10) 1
(2, 7) 1
(2, 0) 1
查看文档。它说CountVectorizer.fit_transform
期望字符串的峰值(例如,字符串的A 列表)。您正在传递单字符串。
这是有道理的,Scikit中的Fit_transform做了两件事:1)它学习了一个模型(fit)2)它在数据(转换)上应用模型。您想构建一个矩阵,其中列是词汇中的所有单词,而行与文档相对应。为此,您需要了解语料库中的整个词汇(所有列)。
当您提供原始数据时,会发生此问题,意味着直接将字符串授予提取功能,而是可以给出y = [x],并将此y传递为参数,然后您将纠正我也遇到了这个问题