如何使用expand.grid值为R中的ranger运行各种模型超参数组合

我看过很多关于如何使用expand.grid为模型选择自变量，然后根据该选择创建公式的帖子。但是，我会事先准备好输入表，并将它们存储在列表中。

library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris)  # let's assume these are different input tables

我对尝试给定算法的所有可能的超参数组合很感兴趣(这里：使用ranger的随机森林(，用于我的输入表列表。我执行以下操作来设置网格：

hyper_grid <- expand.grid(
Input_table = names(Input_list),
Trees = c(10, 20),
Importance = c("none", "impurity"),
Classification = TRUE,
Repeats = 1:5,
Target = "Species")
> head(hyper_grid)
Input_table Trees Importance Classification Repeats  Target
1       iris1    10       none           TRUE       1 Species
2       iris2    10       none           TRUE       1 Species
3       iris1    20       none           TRUE       1 Species
4       iris2    20       none           TRUE       1 Species
5       iris1    10   impurity           TRUE       1 Species
6       iris2    10   impurity           TRUE       1 Species

我的问题是，将这些值传递给模型的最佳方式是什么？目前我使用的是for loop:

for (i in 1:nrow(hyper_grid)) {
RF_train <- ranger(
dependent.variable.name = hyper_grid[i, "Target"], 
data = Input_list[[hyper_grid[i, "Input_table"]]],  # referring to the named object in the list
num.trees = hyper_grid[i, "Trees"], 
importance = hyper_grid[i, "Importance"], 
classification = hyper_grid[i, "Classification"])  # otherwise regression is performed
print(RF_train)
}

在网格的每一行上迭代。但首先，我现在必须告诉模型是分类还是回归。我假设因子Species被转换为数值因子级别，因此默认情况下会发生回归。有没有办法防止这种情况发生，并将例如apply用于该角色？这种迭代方式也会导致混乱的函数调用：

Call:
ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i,      "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i,      "Importance"], classification = hyper_grid[i, "Classification"])

第二：在现实中，模型的输出显然不会打印出来，但我会立即捕获重要的结果(主要是RF_train$confusion.matrix(，并将结果与输入参数写入同一行的hyper_grid的扩展版本中。这种性能是否明智而昂贵？因为如果我存储ranger对象，我会在某个时候遇到内存问题。

谢谢！

我认为将所需值的训练和提取打包到函数中是最干净的。点(...(需要与下面的purrr::pmap函数一起使用。

fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
RF_train <- ranger(
dependent.variable.name = Target, 
data = Input_list[[Input_table]],  # referring to the named object in the list
num.trees = Trees, 
importance = Importance, 
classification = Classification)  # otherwise regression is performed
data.frame(Prediction_error = RF_train$prediction.error,
True_positive = RF_train$confusion.matrix[1])
}

然后，您可以通过使用例如purrr::pmap:映射行来将结果添加为列

hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)

通过这种方式映射，函数是逐行应用的，因此您不应该遇到内存问题。

purrr::pmap的结果是一个列表，这意味着列res包含每一行的列表。使用tidyr::unnest将该列表的元素分布在数据帧中，可以对其进行无测试。

tidyr::unnest(hyper_grid, res)

我认为这种方法非常优雅，但它需要一些琐碎的知识。如果你想了解更多这方面的知识，我强烈推荐这本书。第25章(许多模型(描述了一种与我在这里采用的方法类似的方法。

相关内容

最新更新

热门标签：