scikit-learn中随机森林分类器的故障排除

我正在尝试运行sci-kit learn中的随机森林分类器，结果得到了可疑的坏输出——只有不到1%的预测是正确的。这个模型的表现比偶然性差得多。我对Python、ML和sci-kit学习相对较新(这是三重打击)，我担心的是我错过了一些基本的东西，而不是需要微调参数。我希望有更多的老手来仔细查看代码，看看设置是否有问题。

我试图根据单词的出现次数来预测电子表格中各行的类，因此每行的输入都是一个数组，表示每个单词出现的次数，例如[1 0 0 2 0…1]。我使用sci-kitlearn的CountVectorizer进行处理——我给它提供包含每行单词的字符串，它输出单词出现数组。如果由于某种原因，这种输入不合适，那可能就是出了问题的地方，但我在网上或文档中没有发现任何迹象表明情况确实如此。

现在，森林在大约0.5%的时间里都是正确的。在SGD分类器中使用完全相同的输入会产生接近80%的结果，这对我来说表明我正在做的预处理和矢量化很好——这是RF分类器特有的。我的第一反应是寻找过度拟合，但即使我在训练数据上运行模型，它仍然会出错。

我已经处理了大量的树和训练数据，但这对我来说似乎没有太大变化。我试图只显示相关的代码，但如果有帮助的话，可以发布更多。第一个SO帖子，所以所有的想法和反馈都很感激。

#pull in package to create word occurence vectors for each line
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,charset_error='ignore')
X_train = vectorizer.fit_transform(train_file)
#convert to dense array, the required input type for random forest classifier
X_train = X_train.todense()
#pull in random forest classifier and train on data
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, compute_importances=True)
clf = clf.fit(X_train, train_targets)
#transform the test data into the vector format
testdata = vectorizer.transform(test_file)
testdata = testdata.todense()

#export
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile)
for item in clf.predict(testdata):
spamwriter.writerow([item])

如果使用随机森林(RF)，你在训练集X_train上表现得很差，那么肯定有问题，因为你应该得到很大的百分比，超过90%。请尝试以下操作(首先是代码片段)：

print "K-means" 
clf  = KMeans(n_clusters=len(train_targets), n_init=1000, n_jobs=2)
print "Gaussian Mixtures: full covariance"
covar_type = 'full'    # 'spherical', 'diag', 'tied', 'full'     
clf = GMM(n_components=len(train_targets), covariance_type=covar_type, init_params='wc', n_iter=10000)
print "VBGMM: full covariance"
covar_type = 'full'    # 'spherical', 'diag', 'tied', 'full'     
clf = VBGMM(n_components=len(train_targets), covariance_type=covar_type, alpha=1.0, random_state=None, thresh=0.01, verbose=False, min_covar=None, n_iter=1000000, params='wc', init_params='wc')
print "Random Forest"
clf = RandomForestClassifier(n_estimators=400, criterion='entropy', n_jobs=2)
print "MultiNomial Logistic Regression"
clf = LogisticRegression(penalty='l2', dual=False, C=1.0, fit_intercept=True, intercept_scaling=1, tol=0.0001)
print "SVM: Gaussian Kernel, infty iterations"
clf = SVC(C=1.0, kernel='rbf', degree=3, gamma=3.0, coef0=1.0, shrinking=True,
probability=False, tol=0.001, cache_size=200, class_weight=None, 
verbose=False, max_iter=-1, random_state=None)

不同的分类器，sci-ket-learn中的界面基本上总是相同的，并查看它们的行为(也许RF并不是最好的)。参见上面的代码
尝试创建一些随机生成的数据集以提供给RF分类器，我强烈怀疑生成vectorizer对象的映射过程中出现了问题。因此，请开始创建X_train，然后参阅

希望这能帮助

相关内容

最新更新

热门标签：