使用Joblib在Sklearn估计器上并行进行随机网格搜索



我正在尝试在Sklearn估计器上运行随机网格搜索,但是我不想交叉估算,因为我已经有火车/验证/测试分配了我的数据。我已经构建了该功能以运行随机网格搜索,但我想跨线程并行化。我一直在研究Joblib,并试图弄清楚如何修改并行(延迟(FUNC))功能,但无法弄清楚如何在我的代码上实现。

这是我的功能:

def randomized_grid_search(model=None, param_grid=None, percent=0.5,
                           X_train=None, y_train=None, 
                           X_val=None, y_val=None):        
    # converts parameter grid into a list
    param_list = list(ParameterGrid(param_grid))
    # the number of combinations to try in the grid
    n = int(len(param_list) * percent)
    # the reduced grid as a list
    reduced_grid = sample(param_list, n)
    best_score = 0
    best_grid = None
    """ 
    Loops through each of the posibble scenarios and
    then scores each model with prediction from validation set.
    The best score is kept and held with best parameters.
    """ 
    for g in reduced_grid:
        model.set_params(**g)
        model.fit(X_train,y_train)
        y_pred = model.predict(X_val)
        recall = recall_score(y_val, y_pred)
        if recall > best_score:
            best_score = recall
            best_grid = g
    """
    Combines the training and validation datasets and 
    trains the model with the best parameters from the 
    grid search"""
    best_model = model
    best_model.set_params(**best_grid)
    X2 = pd.concat([X_train, X_val])
    y2 = pd.concat([y_train, y_val])
    return best_model.fit(X2, y2)

来自https://joblib.readthedocs.io/en/latest/parallel.html,我认为这是我需要前进的方向:

with Parallel(n_jobs=2) as parallel:
    accumulator = 0.
    n_iter = 0
    while accumulator < 1000:
       results = parallel(delayed(sqrt)(accumulator + i ** 2)
                          for i in range(5))
       accumulator += sum(results)  # synchronization barrier
       n_iter += 1

我应该做这样的事情还是我要接近这一切?

您是否尝试使用n_jobs参数使用内置并行化?

grid = sklearn.model_selection.GridSearchCV(..., n_jobs=-1)

GridSearchCV文档将N_jobs参数描述为:

n_jobs:int或none,可选(默认值=无)并行运行的作业数。除非在joblib.parallel_backend上下文中,否则无需1。-1表示使用所有处理器...

因此,虽然这不会跨线程分发,但它将在处理器上分发;从而达到一定程度的并行化。

我在github上找到了 @skylander86在作者使用的地方撰写的一些代码:

param_scores = Parallel(n_jobs=self.n_jobs)(delayed(_fit_classifier)(klass, self.classifier_args, param, self.metric, X_train, Y_train, X_validation, Y_validation) for param in ParameterGrid(self.param_grid))

我希望有帮助。

最新更新