如何使用相同的词汇量化新评论

你好，我正在与Sklearn合作，我的列表看起来如下：

list = ["comment1","comment2",...,"commentN"]

然后我构建了一个矢量器来构建矩阵，

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,stop_words=stpw)

我使用fit_transform矢量化此列表

tf = tf_vectorizer.fit_transform(list)

我构建了8个数据簇，

kmeans = KMeans(n_clusters=8, random_state=0).fit(tf)

最后，我使用了称为预测的方法来产生每个向量

y_pred = kmeans.predict(tf)

现在，我有一个新的评论，我想与我的previos数据群相关联，

comment = ["newComment"]

我尝试过，首先将评论矢量化以使用预测如下：

newVec = CountVectorizer(vocabulary=tf.vocabulary_)
testComment = newVec.fit_transform(comment)
y_pred_Comment = kmeans.predict(comment)
print(y_pred_Comment)

问题是我遇到了错误，因为这个名为newvec的新矢量器并没有拿走我所有的词汇，我要感谢帮助矢量化我的新评论，但使用TF_Vectorizer.fit_transform（列表）之前生产的相同模型，

与错误相关的错误：

<ipython-input-32-69c8879d551a> in <module>()
    129 
    130 
--> 131 newVec = CountVectorizer(vocabulary=tf.vocabulary_)
    132 
    133 comment = ["newComment"]
C:Program FilesAnaconda3libsite-packagesscipysparsebase.py in __getattr__(self, attr)
    557             return self.getnnz()
    558         else:
--> 559             raise AttributeError(attr + " not found")
    560 
    561     def transpose(self, axes=None, copy=False):
AttributeError: vocabulary_ not found

我认为您对scikit中模型的使用方式有些误解。您想在训练集中训练模型，然后将相同的型号应用于测试集。因此，在您的示例中（但是使用新闻组数据）

from sklearn import datasets, feature_extraction, neighbors, cluster
newsgroups_train = datasets.fetch_20newsgroups(subset='train').data[:200]
newsgroups_test = datasets.fetch_20newsgroups(subset='test').data[:100]
tf_vectorizer = feature_extraction.text.CountVectorizer()
tf_train = tf_vectorizer.fit_transform(newsgroups_train)
kmeans = cluster.KMeans(n_clusters=8, random_state=0).fit(tf)
y_pred = kmeans.predict(tf_train)

现在，我们有了一个矢量图和聚类模型，我们可以将其应用于新数据。

tf_test = tf_vectorizer.transform(newsgroups_test)
y_pred_test = kmeans.predict(tf_test)

相关内容

最新更新

热门标签：