我有一个包含165个实例和49个特性的数据集,目标为1和0。这个数据集缺少值,所以我尝试KNNimputer进行五倍交叉验证。这是代码:
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from pandas import read_csv
imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
df=read_csv('data.csv', header=None,na_values='?')
data=df.values
ix = [i for i in range(data.shape[1]) if i != 49]
X, y = data[:, ix], data[:, 49]
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
但问题是我不需要分数。我想要在填充折叠中缺失的值后的数据集(五个折叠或整个(,因为我需要在插补后使用五个折叠进行特征选择,然后进行分类。那么,如何在插补后获得数据集呢?
正如评论中所讨论的,CV过程在这里没有任何实际帮助。您实际需要的是:
- 拟合
KNNImputer
并使用它转换(估算(训练数据 - 使用这个已经拟合的估算器来相应地转换你看不见的数据
这样,您的训练和测试数据将共享一个通用的估算程序,因此您选择的任何特征选择方法都将实际适用于这两个数据集。
以下是一个使用伪数据的演示,改编自文档中的示例:
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]] # dummy data
imputer = KNNImputer(n_neighbors=2)
X_imp = imputer.fit_transform(X) # fit imputer & transform training dta in 1 step
X_imp
# result:
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])
# new (unseen - test) data with missing values:
# we DON'T fit the imputer again
X_new = np.array([[7, 3, 4], [np.nan, 8, 7]])
X_new_imp = imputer.transform(X_new) # use the imputer already fitted with the training data
X_new_imp
# result:
array([[7. , 3. , 4. ],
[5.5, 8. , 7. ]])