LeaveOneOut确定knn中的k



我想知道k-nearest-neighbor的最佳k。我正在使用LeaveOneOut将我的数据划分为训练集和测试集。在下面的代码中,我有150个数据条目,所以我得到了150个不同的训练和测试集。K应该在1和40之间。

我想把交叉验证平均分类误差绘制成k的函数,看看哪个k对KNN最好。

这是我的代码:

import scipy.io as sio
import seaborn as sn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import LeaveOneOut    
error = []
array = np.array(range(1,41))
dataset = pd.read_excel('Data/iris.xls')
X = dataset.iloc[:, :-1].values  
y = dataset.iloc[:, 4].values
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#print(X_train, X_test, y_train, y_test)
for i in range(1, 41):  
classifier = KNeighborsClassifier(n_neighbors=i)  
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
error.append(np.mean(y_pred != y_test))
plt.figure(figsize=(12, 6))  
plt.plot(range(1, 41), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')  
plt.xlabel('K Value')  
plt.ylabel('Mean Error')

您正在计算每次预测的误差,这就是为什么您的error数组中有6000个点。您需要收集给定'n_neighbors'的折叠中所有点的预测,然后计算该值的误差。

你可以这样做:

# Loop over possible values of "n_neighbors"
for i in range(1, 41):  
# Collect the actual and predicted values for all splits for a single "n_neighbors"
actual = []
predicted = []

for train_index, test_index in loo.split(X):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier = KNeighborsClassifier(n_neighbors=i)  
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Append the single predictions and actual values here.
actual.append(y_test[0])
predicted.append(y_pred[0])
# Outside the loop, calculate the error.
error.append(np.mean(np.array(predicted) != np.array(actual))) 

代码的其余部分可以。

如果您使用cross_val_predict,有一种更紧凑的方法可以做到这一点

from sklearn.model_selection import cross_val_predict
for i in range(1, 41):  
classifier = KNeighborsClassifier(n_neighbors=i)  
y_pred = cross_val_predict(classifier, X, y, cv=loo)
error.append(np.mean(y_pred != y))

相关内容

  • 没有找到相关文章

最新更新