Gridsearch technique in sklearn, python

我正在研究一种监督机器学习算法，它似乎有一种奇怪的行为。所以，让我开始：

我有一个函数，在其中传递不同的分类器、它们的参数、训练数据和它们的标签：

def HT(targets,train_new, algorithm, parameters):
#creating my scorer
scorer=make_scorer(f1_score)
#creating the grid search object with the parameters of the function
grid_search = GridSearchCV(algorithm, 
param_grid=parameters,scoring=scorer,   cv=5)
# fit the grid_search object to the data
grid_search.fit(train_new, targets.ravel())
# print the name of the classifier, the best score and best parameters
print algorithm.__class__.__name__
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
# assign the best estimator to the pipeline variable
pipeline=grid_search.best_estimator_
# predict the results for the training set
results=pipeline.predict(train_new).astype(int)
print results    
return pipeline

对于这个函数，我传递如下参数：

clf_param.append( {'C' : np.array([0.001,0.01,0.1,1,10]), 
'kernel':(['linear','rbf']),
'decision_function_shape' : (['ovr'])})

好的，这就是事情开始变得奇怪的地方。此函数返回一个f1_score但它与我使用以下公式手动计算的分数不同：F1 = 2 * (精度 * 召回率(/(精度 + 召回率(

差异很大(0.68 与 0.89 相比(

我在函数中做错了什么？由 grid_search (grid_search.best_score_( 计算的分数应该与整个训练集 (grid_search.best_estimator_.predict(train_new(( 的分数相同？谢谢

手动计算的分数考虑了所有类的全局真阳性和负数。但是在f1_score scikit中，默认方法是计算二进制平均值(即仅适用于正类(。

因此，为了获得相同的分数，请使用下面指定的f1_score：

scorer=make_scorer(f1_score, average='micro')

或者简单地，在gridSearchCV中使用：

scoring = 'f1_micro'

有关如何进行分数平均的更多信息，请参见： - http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

您可能还想看看以下答案，其中详细描述了scikit中分数的计算：

https://stackoverflow.com/a/31575870/3374996

编辑：将宏更改为微观。如文档中所述：

"micro"：通过计算总数来计算全局指标 true 阳性、假阴性和假阳性。

相关内容

最新更新

热门标签：