在k倍cv程序中,分类编码应该在哪里进行



我想在我的机器学习模型中应用交叉验证方法。对于这些模型,我想要一个功能选择和一个网格搜索也要应用。想象一下,我想通过应用基于F分数(ANOVA(的特征选择技术来评估K-最近邻分类器的性能,该技术选择了10个最相关的特征。代码如下:

# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0

for train_index, test_index in rkf.split(X, y):

# Print a dot for each train / test partition
sys.stdout.write('.')
sys.stdout.flush()

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

###- In order to select the best number of neighbors -###  
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, 
scoring=make_scorer(accuracy_score))
result = gridcv.fit(X_train, y_train)

###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data    
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)

# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))

split += 1

然而,我的列是分类的(二进制标签除外(,我需要进行分类编码。我不能删除这些列,因为它们是必不可少的。

你将在哪里进行这种编码,以及如何解决每个折叠中看不见的标签的分类编码问题?

分类编码应该作为第一步执行,这正是为了避免您提到的关于每个折叠中看不见的标签的问题。

此外,您当前的实现还存在数据泄漏问题。在执行内部交叉验证之前,您正在对完整的X_train数据集执行特征缩放。这可以通过在用于GridSearchCV:的管道上包含StandardScaler来解决

...
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###  
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
...

另外几个提示:

  1. GridSearchCV具有best_estimator_属性,该属性可用于提取具有找到的最佳超参数集的估计器
  2. GridSearchCVrefit=True(默认值(一起使用时,可以直接使用对象执行预测,例如gridcv.predict(X_test)

编辑:也许我在方面太笼统了,当执行分类编码时。您的方法应该取决于您的问题/数据集。

如果你事先知道存在多少分类特征,并且你想用这些知识来训练你的内部CV分类器,那么你应该首先执行分类编码。

如果在训练时你不知道你将看到多少分类特征,或者你想在不知道所有分类特征的情况下训练你的CV分类器,你应该在每个折叠处执行分类编码。

当使用前者时,所有分类器都将在同一特征空间上进行训练,而后者则不能保证这一点。

如果使用后者,上述管道可以扩展为包含分类编码:

pipeline = Pipeline(
[
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
],
)

我建议您仔细阅读scikit learn的用户指南中的编码分类特征部分。

相关内容

最新更新