我正在使用Python进行机器学习的实验,这是我想在我的实验中添加Precisión指标和混淆矩阵,我的完整代码如下:
print('Random Forest Testing')
from sklearn import svm
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import csv
from sklearn import preprocessing
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
打开CSV:
f = open('Telcel_facebook_comments_train.csv')
csv_f = csv.reader(f)
创建vectorizer tfidf:
vectorizer = TfidfVectorizer(analyzer='char',ngram_range=(1, 3))
列表保留评论和标签:
list_comments=[]
list_tags=[]
for row in csv_f:
list_comments.append(row[0])
list_tags.append(row[1])
X = vectorizer.fit_transform(list_comments)
print(X)
vectorizadorEtiquetas= preprocessing.LabelEncoder()
Y=vectorizadorEtiquetas.fit_transform(list_tags)
print(Y)
获取功能的名称:
tfidf_words=vectorizer.get_feature_names()
clf = svm.SVR()
#Second Machine learning algorithm
clf2 = RandomForestClassifier(n_estimators=10)
clf2 = clf2.fit(X, Y)
#building X train and Y train matrix
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.33, random_state=47)
print('Starting training')
#clf.fit(X_train, y_train)
clf2.fit(X_train, y_train)
print('Training Completed')
print(clf2.score(X_test, y_test))
导入混淆矩阵和回忆
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
这是当我需要添加精度和混淆矩阵时,以下代码是错误的,因为我不知道如何获取称为" y_true"的矩阵,我只有三个类别:1,2,3
print(precision_recall_fscore_support(y_true, y_pred, average='macro'))
print(confusion_matrix(y_true, y_pred))
更清楚这是输出的一部分:
Random Forest Testing
(0, 2128) 0.225797583675
(0, 6205) 0.243191128615
(0, 6366) 0.21798642306
(0, 3292) 0.204253719304
(0, 4763) 0.161726027808
(0, 1950) 0.264734992986
(0, 6457) 0.264734992986
(0, 5153) 0.264734992986
(0, 3216) 0.105568550619
(0, 4760) 0.128342578419
[3 1 1 ..., 2 2 2]
Starting training
Training Completed
0.881481481481
然而,我要感谢支持显示混乱矩阵并召回指标以了解我的模型,感谢您的支持。
这是我第二次实现结果的努力,现在我尝试过的上面的几行:
y_pred = clf2.predict(X_test)
print('Training Completed')
'''
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you
require for each samplethat each label set be correctly predicted.
'''
print(clf2.score(X_test, y_test))
#importing Confusion Matrix and recall
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
#Here is when I need to add the precision and confusion matrix
print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print(confusion_matrix(y_test, y_pred))
这是输出:
(0.68431620945676808, 0.61034292763991205, 0.63832235955391514, None)
[[159 83 7 0]
[ 3 811 6 0]
[ 5 22 118 0]
[ 0 1 0 0]]
C:Program FilesAnaconda3libsite-packagessklearnmetricsclassification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
现在的问题是我得到了一个4x4的混乱矩阵,我只有三个标签,所以我想在这里获得支持,
让我们分解一下以更好地理解过程:
- 在您的原始数据集中,您有输入样本X,而目标类别y(据我所知,您在这里有三个可能的值:1、2和3)。
- 当调用Train_testrongplit时,您的输入样本和目标类是分配的,生成x_train,x_test,y_train,y_test。
- 您现在必须使用x_train et y_train 培训模型(这是您代码中误解的部分):
clf2 = clf2.fit(X_train, Y_train)
- 现在,该模型已在培训数据上进行了适当的培训,您实际上可以在测试子样本上进行测试。
这样做,您生成 y_pred 您正在寻找:
Y_pred = clf2.predict(X_test)
y_pred是一个1D数组,对于您的模型预测的每个元素都有一个元素。您知道这些类的真实值是什么:y_test。
您现在有y_true和y_test,可以评估您的分类器。
我希望它能有所帮助!