混淆矩阵10交叉折叠-如何做熊猫数据帧df



我正试图获得任何模型(随机森林、决策树、朴素贝叶斯等(的10倍混淆矩阵如果我运行正常模型,我可以正常地得到每个混淆矩阵,如下所示:


from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# implementing train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=66)

# random forest model creation
rfc = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
rfc.fit(X_train,y_train)
# predictions
rfc_predict = rfc.predict(X_test)

print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))

输出[1]:

===混淆矩阵===[[16243 1011][827 16457]]===分类报告===精确回忆f1分数支持0 0.95 0.94 0.95 172541 0.94 0.95 0.95 17284精度0.95 34538宏平均值0.95 0.95 0.95 34538加权平均值0.95 0.95 0.95 34538

但是,现在我想得到10 cv倍数的混淆矩阵。我应该如何处理或做它。我试过了,但没有成功。


# from sklearn import cross_validation
from sklearn.model_selection import cross_validate
kfold = KFold(n_splits=10)

conf_matrix_list_of_arrays = []
kf = cross_validate(rfc, X, y, cv=kfold)
print(kf)
for train_index, test_index in kf:

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

rfc.fit(X_train, y_train)
conf_matrix = confusion_matrix(y_test, rfc.predict(X_test))
conf_matrix_list_of_arrays.append(conf_matrix)

数据集包含此数据帧dp-

Temperature Series Parallel Shading电池数量电压(V(电流(I(I/V太阳能电池板电池阴影百分比IsShade30 10 1 2 10 1.11 2.19 1.97 1985 1 20.0 127 5 2 10 2.33 4.16 1.79 1517 3 100.0 130 5 2 7 10 2.01 4.34 2.16 3532 1 70.0 140 2 4 3 8 1.13-20.87-18.47 6180 1 37.5 145 5 2 4 10 1.13 6.52 5.77 8812 3 40.0 1

在cross_validate的帮助页面中,它不会返回用于交叉验证的索引。您需要使用示例数据集访问(分层(KFold中的索引:

from sklearn import datasets, linear_model
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
data = datasets.load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=66)
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
skf.split(X_train,y_train)
rfc = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
y_pred = cross_val_predict(rfc, X_train, y_train, cv=skf)

我们应用cross_val_predict得到所有的预测:

y_pred = cross_val_predict(rfc, X, y, cv=skf)

然后使用索引将y_pred分解为每个混淆矩阵:

mats = []
for train_index, test_index in skf.split(X_train,y_train):
mats.append(confusion_matrix(y_train[test_index],y_pred[test_index]))

看起来像这样:

mats[:3]
[array([[13,  2],
[ 0, 23]]),
array([[14,  1],
[ 1, 22]]),
array([[14,  1],
[ 0, 23]])]

检查矩阵列表和总和的相加是否相同:

np.add.reduce(mats)
array([[130,  14],
[  6, 225]])
confusion_matrix(y_train,y_pred)
array([[130,  14],
[  6, 225]])

对我来说,这里的问题在于kf的不正确开箱。事实上,默认情况下,cross_validate()会返回一个包含testrongcores和fit/score时间的数组字典。

相反,您可以利用Kfold实例的split()方法,该方法可以帮助您生成索引,将数据拆分为训练和测试(验证(集。因此,通过转换为

for train_index, test_index in kfold.split(X_train, y_train):

你应该得到你想要的东西。

最新更新