逻辑回归和GridSearchCV使用python-sklearn



我正在尝试此页面中的代码。我跑到零件LR (tf-idf),得到了类似的结果

之后我决定试试GridSearchCV。我的问题如下:

1(

#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'C': 10.0, 'penalty': 'l2'}
#best score : 0.7390325593588823

然后我手动计算f1的分数。为什么不匹配?

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489
  1. 如果我尝试scoring='precision',为什么会出现以下错误?我不清楚,主要是因为我有相对平衡的数据集(55-45%(,并且需要precisionf1正在计算中,没有任何问题

#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='precision')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)

/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
tuned hpyerparameters :(best parameters)  {'C': 0.1, 'penalty': 'l2'}
best score : 0.9474200393672962
  1. 有没有更简单的方法可以返回列车数据的预测?我们已经有了CCD_ 7对象。我用下面的方法得到了预测结果。有更好的方法吗

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]

############################

############更新1

  1. 请回答上面的问题1。在对问题的评论中,它说The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search

是我得到以下结果的原因#tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'} #best score : 0.7390325593588823

但当我手动操作时,我会f1_score(y_train, final_prediction) 0.9839388145315489

2(

我试着按照下面的答案中的建议使用f1_micro进行调谐。没有错误消息。我仍然不清楚为什么precision在失败时f1_micro没有失败

from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"], "solver":['liblinear','newton-cg'], 'class_weight':[{ 0:0.95, 1:0.05 }, { 0:0.55, 1:0.45 }, { 0:0.45, 1:0.55 },{ 0:0.05, 1:0.95 }]}# l1 lasso l2 ridge
#logreg=LogisticRegression(solver = 'liblinear')
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1_micro')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
tuned hpyerparameters :(best parameters)  {'C': 10.0, 'class_weight': {0: 0.45, 1: 0.55}, 'penalty': 'l2', 'solver': 'newton-cg'}
best score : 0.7894909688013136

您最终会精确地出现错误,因为您的一些惩罚对该模型来说太强了,如果您检查结果,当C=0.001和C=0.01 时,f1得分为0

res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[:,res.columns.str.contains("split[0-9]_test_score|params",regex=True)]

params  split0_test_score  split1_test_score  split2_test_score
0   {'C': 0.001, 'penalty': 'l2'}           0.000000           0.000000           0.000000
1    {'C': 0.01, 'penalty': 'l2'}           0.000000           0.000000           0.000000
2     {'C': 0.1, 'penalty': 'l2'}           0.973568           0.952607           0.952174
3     {'C': 1.0, 'penalty': 'l2'}           0.863934           0.851064           0.836449
4    {'C': 10.0, 'penalty': 'l2'}           0.811634           0.769547           0.787838
5   {'C': 100.0, 'penalty': 'l2'}           0.789826           0.762162           0.773438
6  {'C': 1000.0, 'penalty': 'l2'}           0.781003           0.750000           0.763871

您可以查看:

lr = LogisticRegression(C=0.01).fit(X_train_vectors_tfidf,y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])

预测的概率向拦截漂移:

# expected probability
np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))
array([0.41764462])
lr.predict_proba(X_train_vectors_tfidf)

array([[0.58732636, 0.41267364],
[0.57074279, 0.42925721],
[0.57219143, 0.42780857],
...,
[0.57215605, 0.42784395],
[0.56988186, 0.43011814],
[0.58966184, 0.41033816]])

关于";返回列车数据的预测";,我想这是唯一的办法。使用最佳参数在整个训练集上重新修改模型,但不存储预测或预测概率。如果您正在寻找在训练/测试过程中获得的值,您可以检查cross_val_prdict

最新更新