网格搜索在给定自定义模型的情况下返回完全相同的结果



我将Scikit Learn Random Forest模型封装在一个函数中,如下所示:

from sklearn.base import BaseEstimator, RegressorMixin
class Model(BaseEstimator, RegressorMixin):
def __init__(self, model):
self.model = model

def fit(self, X, y):
self.model.fit(X, y)

return self

def score(self, X, y):

from sklearn.metrics import mean_squared_error

return mean_squared_error(y_true=y, 
y_pred=self.model.predict(X), 
squared=False)

def predict(self, X):
return self.model.predict(X)
class RandomForest(Model):
def __init__(self, n_estimators=100, 
max_depth=None, min_samples_split=2,
min_samples_leaf=1, max_features=None):

self.n_estimators=n_estimators 
self.max_depth=max_depth
self.min_samples_split=min_samples_split
self.min_samples_leaf=min_samples_leaf
self.max_features=max_features

from sklearn.ensemble import RandomForestRegressor

self.model = RandomForestRegressor(n_estimators=self.n_estimators, 
max_depth=self.max_depth, 
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf, 
max_features=self.max_features,
random_state = 777)


def get_params(self, deep=True):
return {"n_estimators": self.n_estimators,
"max_depth": self.max_depth,
"min_samples_split": self.min_samples_split,
"min_samples_leaf": self.min_samples_leaf,
"max_features": self.max_features}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self

我主要遵循Scikit Learn官方指南,该指南可在https://scikit-learn.org/stable/developers/develop.html

以下是我的网格搜索:

grid_search = GridSearchCV(estimator=RandomForest(), 
param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]},
n_jobs=-1, 
scoring='neg_root_mean_squared_error',
cv=5, verbose=True).fit(X, y)

print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))

网格搜索输出结果和网格搜索.cv_results_打印在下方

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
mean_fit_time  std_fit_time  mean_score_time  std_score_time  
0       0.210918      0.002450         0.016754        0.000223   
1       0.207049      0.001675         0.016579        0.000147   
2       0.206495      0.002001         0.016598        0.000158   
3       0.206799      0.002417         0.016740        0.000144   
4       0.207534      0.001603         0.016668        0.000269   
5       0.206384      0.001396         0.016605        0.000136   
6       0.220052      0.024280         0.017247        0.001137   
7       0.226838      0.027507         0.017351        0.000979   
8       0.205738      0.003420         0.016246        0.000626   
param_max_depth param_n_estimators                                 params  
0               1                 10   {'max_depth': 1, 'n_estimators': 10}   
1               1                100  {'max_depth': 1, 'n_estimators': 100}   
2               1                300  {'max_depth': 1, 'n_estimators': 300}   
3               3                 10   {'max_depth': 3, 'n_estimators': 10}   
4               3                100  {'max_depth': 3, 'n_estimators': 100}   
5               3                300  {'max_depth': 3, 'n_estimators': 300}   
6               6                 10   {'max_depth': 6, 'n_estimators': 10}   
7               6                100  {'max_depth': 6, 'n_estimators': 100}   
8               6                300  {'max_depth': 6, 'n_estimators': 300}   
split0_test_score  split1_test_score  split2_test_score  split3_test_score  
0          -5.246725          -3.200585          -3.326962          -3.209387   
1          -5.246725          -3.200585          -3.326962          -3.209387   
2          -5.246725          -3.200585          -3.326962          -3.209387   
3          -5.246725          -3.200585          -3.326962          -3.209387   
4          -5.246725          -3.200585          -3.326962          -3.209387   
5          -5.246725          -3.200585          -3.326962          -3.209387   
6          -5.246725          -3.200585          -3.326962          -3.209387   
7          -5.246725          -3.200585          -3.326962          -3.209387   
8          -5.246725          -3.200585          -3.326962          -3.209387   
split4_test_score  mean_test_score  std_test_score  rank_test_score  
0          -2.911422        -3.579016        0.845021                1  
1          -2.911422        -3.579016        0.845021                1  
2          -2.911422        -3.579016        0.845021                1  
3          -2.911422        -3.579016        0.845021                1  
4          -2.911422        -3.579016        0.845021                1  
5          -2.911422        -3.579016        0.845021                1  
6          -2.911422        -3.579016        0.845021                1  
7          -2.911422        -3.579016        0.845021                1  
8          -2.911422        -3.579016        0.845021                1  
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    3.2s finished

我的问题是,为什么网格搜索在所有数据分割上都返回完全相似的结果?

我的假设是,对于所有数据拆分,网格搜索似乎只执行1个参数网格(例如,{最大_深度':1,'n_expectors':10}(。如果是这样的话,为什么会发生这种情况?

最后,如何使网格搜索能够返回所有数据拆分的正确结果?

set_params方法实际上不会更改self.model属性中RandomForestRegressor实例的超参数。相反,它直接将属性设置为RandomForest实例(以前不存在,不会影响实际模型!(。因此,网格搜索重复设置这些无关紧要的新参数,并且每次拟合的实际模型都是一样的。(类似地,get_params方法获得RandomForest属性,这些属性与RandomForestRegressor属性不同。(

通过让set_params只调用self.model.set_params(并且让get_params使用self.model.<parameter_name>而不是self.<parameter_name>(,您应该能够解决大部分问题

我认为还有另一个问题,但我根本不知道您的示例是如何运行的,因为它:您使用self.<parameter_name>实例化model属性,但这在__init__中从未定义过。

最新更新