scikit-学习如何使用管道查看特征重要性以及如何进行逻辑+岭回归



两个问题:

我正在尝试运行一个预测客户流失的模型。我的很多特征都有多重共线性问题。为了解决这个问题,我尝试用Ridge来惩罚系数。

更具体地说,我正在尝试运行逻辑回归,但对模型应用Ridge惩罚(不确定这是否有意义)…

问题:

  1. 选择脊回归分类器是否足够?或者我是否需要选择逻辑回归分类器并在其上附加一些岭罚参数(即LogisticRegression(apply_penality=Ridge)

    )
  2. 我试图确定功能的重要性,并通过一些研究,似乎我需要使用这个:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

然而,如果我的模型是围绕sklearn.pipeline.make_pipeline函数构建的,我对如何访问这个函数感到困惑。

我只是想找出哪些自变量在预测我的标签时最重要。

下面的代码供参考

#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')
#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]
#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)

'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())
#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()
#putting together a parameter grid to search over using grid search
params={
'selectkbest__k':[1,2,3,4,5,6],
'ridge__fit_intercept':[True,False],
'ridge__alpha':[0.01,0.1,1,10],
'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)
#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')
#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'mean_test_score',
'std_test_score', 'rank_test_score']].head()
#checking the selected permutation of parameters
gs.best_params_
'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs. 
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)
#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)

选择脊回归分类器是否足够?或者我是否需要选择逻辑回归分类器并将其与山脊惩罚的一些参数附加在一起(即LogisticRegression(apply_penality= ridge)

所以岭回归和逻辑回归之间的问题归结为你是在尝试分类还是回归。如果你想在一些连续的基础上预测流失的数量,使用ridge,如果你想预测某人是否流失或他们可能流失,使用逻辑回归。

Sklearn的LogisticRegression默认使用l2归一化,这相当于岭回归使用的正则化。所以如果你想要正则化的话,你应该可以很好地使用它:)

我正试图确定功能的重要性,通过一些研究,似乎我需要使用这个。

一般情况下,您可以通过named_steps属性访问管道的元素。所以在你的情况下,如果你想访问SelectKBest,你可以这样做:

pipe.named_steps["SelectKBest"].get_feature_names()

这将为您提供特征名称,现在您仍然需要值。这里你必须访问你的模型学习系数。对于岭回归和逻辑回归,它应该是这样的:

pipe.named_steps["logisticregression"].coef_

如果你想要更详细的教程,我有一篇关于这个的博客文章在这里

最新更新