我正在尝试用新的训练数据更新scikit多项分类器。以下是我尝试过的
from sklearn.feature_extraction.text import HashingVectorizer
import numpy as np
from sklearn.naive_bayes import MultinomialNB
# Training with first training set
targets = ['education','film','sports','laptops','phones']
x = ["football is the sport","gravity is the movie", "education is imporatant","lenovo is a laptop","android phones"]
y = np.array([2,1,0,3,4])
clf = MultinomialNB()
vectorizer = HashingVectorizer(stop_words='english', non_negative=True,
n_features=32*2)
X_train = vectorizer.transform(x)
clf.partial_fit(X_train, y, classes=[0,1,2,3,4])
#Testing with First training set
test_data = ["android","lenovo","Transformers"]
X_test = vectorizer.transform(test_data)
print "Using Initial classifier"
pred = clf.predict(X_test)
for doc, category in zip(test_data, pred):
print('%r => %s' % (doc, targets[category]))
# Training with updated training set
x = ["cricket", "Transformers is a film","which college"]
y = np.array([2,1,0])
X_train = vectorizer.transform(x)
clf.partial_fit(X_train, y)
# Testing with the updated trainign set
test_data = ["android","lenovo","Transformers"]
X_test = vectorizer.transform(test_data)
print "nUsing Updatable classifiers"
pred = clf.predict(X_test)
for doc, category in zip(test_data, pred):
print('%r => %s' % (doc, targets[category]))
输出为
Using Initial classifier
'android' => phones
'lenovo' => laptops
'Transformers' => education
Using Updatable classifiers
'android' => sports
'lenovo' => education
'Transformers' => film
我有两个问题->
1)"联想"的类别是错误的,因为在更新分类器时不包括该类别的训练数据。有什么办法可以避免这种情况吗?因为我不想每次更新分类器时都为每个类别提供训练数据。所以它应该工作,即使我提供单一类别的数据,而更新。
2)如何向现有分类器添加新类别。比如,如果我想在现有的分类器中添加一个新的分类,比如"健康"。那有没有办法呢?
感谢帮助。由于
第一批不是调用fit
,而是调用partial_fit
,并将问题中所有类的列表作为classes
参数:
clf.partial_fit(X, y, classes=targets)
(这是假设y
实际上包含类标签而不是它们的索引)
您不能在第一次调用partial_fit
(或fit
)后更改类的数量。您只需要预先知道类的数量,或者重新训练整个模型。