我有一个 excel 文件,它从我的模型中预测了值和概率,我需要从这个 excel 中绘制这个多类的 ROC 曲线,这是针对 Intent1,2,3(大约有 70 个意图(。
Utterence Intent_1 Conf_intent1 Intent_2 Conf_Intent2 ...so on
Uttr 1 Intent1 0.86 Intent2 0.45
Uttr2 Intent3 0.47 Intent1 0.76
Uttr3 Intent1 0.70 Intent3 0.20
Uttr4 Intent3 0.42 Intent2 0.67
Uttr5 Intent1 0.70 Intent3 0.55
Note: Probability is done on absolute scoring so will not add to 1 for particular utterence the highest probability will be predicted
这是我收到的错误代码:
import pandas as pd
import numpy as np
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
#reading the input file
df = pd.read_excel('C:\test.xlsx')
#Converting the columns to array
predicted = df['Predicted'].to_numpy()
Score = df['Probability'].to_numpy()
labels=df['Predicted'].unique();mcm = multilabel_confusion_matrix(actual, predicted, labels=labels)
predicted = label_binarize(predicted, classes=labels)
n_class = predicted.shape[0]
print(n_class)
print(type(predicted))
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_class):
fpr[i], tpr[i], _ = roc_curve(predicted[:, i], Score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Plot of a ROC curve for a specific class
for i in range(n_class):
plt.figure()
plt.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f)' % roc_auc[i])
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
但是我收到错误:
File "roc.py", line 61, in <module>
fpr[i], tpr[i], _ = roc_curve(predicted[:, i], Score[:, i])
IndexError: too many indices for array
然后我从预测和分数中删除了 [:,1]
raise ValueError("{0} format is not supported".format(y_type))
ValueError: multilabel-indicator format is not supported
谁能帮我解决这个问题?
您需要在代码中进行一些更改:
-
首先,从统计的角度来看:ROC AUC 是通过将预测的概率分数与实际标签进行比较来衡量的。您正在将预测概率与预测标签进行比较。这是没有道理的,因为它们显然密切相关。
-
其次,从代码的角度来看:
n_classes
不应该衡量观察的数量,而应该衡量类的数量。因此,您应该做n_class = predicted.shape[1]
我把这个答案放在一起,试图尽可能坚持你的代码:
actual = df['Actual'].to_numpy()
Score = df[['Conf_intent1','Conf_intent2','Conf_intent3']].to_numpy()
labels=df['Actual'].unique()
actual = label_binarize(actual, classes=labels)
n_class = actual.shape[1]
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_class):
fpr[i], tpr[i], _ = roc_curve(actual[:, i], Score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Plot of a ROC curve for a specific class
for i in range(n_class):
plt.figure()
plt.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f)' % roc_auc[i])
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()