我正在查看 sklearn 文档页面"在构建估算器之前插补缺失值"相关代码为:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import cross_val_score
rng = np.random.RandomState(0)
dataset = load_boston()
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean()
print("Score with the entire dataset = %.2f" % score)
# Add missing values in 75% of the lines
missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
dtype=np.bool),
np.ones(n_missing_samples,
dtype=np.bool))) rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)
# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples] estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)
# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
strategy="mean",
axis=0)),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)
现在我想预测,这应该很简单,但是
estimator.predict(X=X_filtered[1:10,:])
返回以下错误:
"AttributeError: 'Imputer' object has no attribute 'statistics_'"
这是怎么回事?
经过更多的研究,我已经想通了。 问题是cross_val_score已经生成了多个模型,并且没有一个模型用于预测。 预测需要单个拟合模型。 这可以通过多种方式解决,其中一个简单的方法如下:
estimator.fit_transform(X_missing, y_missing)
estimator.predict(X=X_filtered[1:10,:])
由于原始示例使用交叉验证,因此另一种可能的途径是使用 GridSearchCV,然后选择使用 best_estimator_,但这超出了本问题的范围。