我想要一个过程,作为结果,它会给我一个机器学习模型及其准确性分数的列表,但只针对给出该类型模型最佳结果的参数集。
例如,这里只有XGBoost:的CV
数据集:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
from sklearn.model_selection import train_test_split
X = data.drop(['target'], axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
查找最佳参数的功能:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer
accu = make_scorer(accuracy_score) # I will be using f1 in future
def predict_for_best_params(alg, X_train, y_train, X_test):
params = {'n_estimators': [200, 300, 500]}
clf = GridSearchCV(alg, params, scoring = accu, cv=2)
clf.fit(X_train, y_train)
print(clf.best_estimator_)
y_pred = clf.predict(X_test)
return y_pred
在一个型号上使用:
from xgboost import XGBClassifier
alg = [XGBClassifier()]
y_pred = predict_for_best_params(alg[0], X_train, y_train, X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
我想要实现的是:
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
alg = [XGBClassifier(), RandomForrest()] # list of many of them
alg_params = {'XGBClassifier': [{'n_estimators': [200, 300, 500]}],
'RandomForrest': [{'max_depth ': [1, 2, 3, 4]}]}
def predict_for_best_params(alg, X_train, y_train, X_test, params):
clf = GridSearchCV(alg, params, scoring = accu, cv=2)
clf.fit(X_train, y_train)
print(clf.best_estimator_)
y_pred = clf.predict(X_test)
return y_pred
for algo in alg:
params = alg_params[str(algo)][0] #this won't work because str(algo) <> e.g. XGBClassifier() but XGBClassier(all default params)
y_pred = predict_for_best_params(algo, X_train, y_train, X_test, params)
print('{} accuracy is: {}'.format(algo, accuracy_score(y_test, y_pred)))
这是实现它的好方法吗?
如果你只担心如何放置密钥,那么你可以使用
params = alg_params[alg.__class__.__name__][0]
这应该只返回alg
对象的类名
对于另一种方法,你可以看看我的另一个答案:
- https://stackoverflow.com/a/51629917/3374996
这个答案利用了这样一个事实,即GridSearchCV可以采用参数组合的dicts列表,其中每个列表将分别展开。但请注意以下几点:
- 如果使用
n_jobs > 1
(使用多处理(,这可能比当前的for-loop
更快 - 然后,您可以使用已完成的
GridSearchCV
的cv_results_
属性来分析分数 - 要计算单个估计器的
y_pred
,可以过滤cv_results_
(可能通过将其导入pandas DataFrame
(,然后再次用找到的最佳参数拟合估计器,然后计算y_pred。但应该很容易