r-H2O如何为GLM选择最佳变量



我把我的预测变量放在下面的网格中。据我所知,这个网格搜索选择了应该在我们的模型中使用的最佳变量,并丢弃了其他变量。然而,我不知道它是根据哪种算法/选择度量来选择最佳变量的。有人能告诉我它是如何选择要保留的变量和要丢弃的变量的吗?

功能:

grid.f <-               h2o.grid(algorithm = "glm",                                     # Setting algorithm type
grid_id = "grid.f",                                    # Id so retrieving information on iterations will be easier later
x = predictors,                                        # Setting predictive features
y = response,                                          # Setting target variable
training_frame = data,                                 # Setting training set
hyper_params = hyper_parameters,                       # Setting apha values for iterations
remove_collinear_columns = T,                          # Parameter to remove collinear columns
lambda_search = T,                                     # Setting parameter to find optimal lambda value
seed = p.seed,                                         # Setting to ensure replicateable results
keep_cross_validation_predictions = F,                 # Setting to save cross validation predictions
compute_p_values = F,                                  # Calculating p-values of the coefficients
family = family,                                       # Distribution type used
standardize = T,                                       # Standardizing continuous variables
nfolds = p.folds,                                      # Number of cross-validations
#max_active_predictors = p.max,                         # Setting for number of features
fold_assignment = "Modulo",                            # Specifying fold assignment type to use for cross validations
link = p.link)                                         # Link function for distribution

即使没有网格搜索,H2O-3的GLM也使用L1正则化(又名"套索"(来计算它可以惩罚模型中的哪些变量。

弹性网是L1(套索(和L2(岭回归(的混合,由α和λ参数控制。

GLM小册子是一个很好的参考细节:

  • http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf

最新更新