我把我的预测变量放在下面的网格中。据我所知,这个网格搜索选择了应该在我们的模型中使用的最佳变量,并丢弃了其他变量。然而,我不知道它是根据哪种算法/选择度量来选择最佳变量的。有人能告诉我它是如何选择要保留的变量和要丢弃的变量的吗?
功能:
grid.f <- h2o.grid(algorithm = "glm", # Setting algorithm type
grid_id = "grid.f", # Id so retrieving information on iterations will be easier later
x = predictors, # Setting predictive features
y = response, # Setting target variable
training_frame = data, # Setting training set
hyper_params = hyper_parameters, # Setting apha values for iterations
remove_collinear_columns = T, # Parameter to remove collinear columns
lambda_search = T, # Setting parameter to find optimal lambda value
seed = p.seed, # Setting to ensure replicateable results
keep_cross_validation_predictions = F, # Setting to save cross validation predictions
compute_p_values = F, # Calculating p-values of the coefficients
family = family, # Distribution type used
standardize = T, # Standardizing continuous variables
nfolds = p.folds, # Number of cross-validations
#max_active_predictors = p.max, # Setting for number of features
fold_assignment = "Modulo", # Specifying fold assignment type to use for cross validations
link = p.link) # Link function for distribution
即使没有网格搜索,H2O-3的GLM也使用L1正则化(又名"套索"(来计算它可以惩罚模型中的哪些变量。
弹性网是L1(套索(和L2(岭回归(的混合,由α和λ参数控制。
GLM小册子是一个很好的参考细节:
- http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf