结合R中Tidymodels的嵌套和rolling_origin



我正试图使用Tidymodels套件中的rolling_origin来训练一个随机林。我希望褶皱正好是一年中的月份。嵌套看起来可以做到这一点,但tune_grid在嵌套数据时无法找到变量。我怎样才能做到这一点?我在下面举了一个可复制的例子。


suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(yardstick))
# Create dummy data ====================================================================================================
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)
set.seed(1)
data_set <- data.frame(
date = dates,
v1 = rnorm(l),
v2 = rnorm(l),
v3 = rnorm(l),
y = rnorm(l)
)
# Random Forest Model  =================================================================================================
model <-
parsnip::rand_forest(
mode = "regression",
trees = tune()) %>%
set_engine("ranger")
# grid specification
params <-
dials::parameters(
trees()
)
# Set up grid and model workflow =======================================================================================
grid <-
dials::grid_max_entropy(
params,
size = 2
)
form <- as.formula(paste("y ~ v1 + v2 + v3"))
model_workflow <-
workflows::workflow() %>%
add_model(model) %>%
add_formula(form)
# Tuning on the normal data set works ====================================================================================================
data_ro_day <- data_set %>%
rolling_origin(
initial = 304,
assess = 30,
cumulative = TRUE,
skip = 30
)
results <- tune_grid(
model_workflow,
grid = grid,
resamples = data_ro_day,
param_info = params,
metrics   = metric_set(mae, mape, rmse, rsq),
control   = control_grid(verbose = TRUE))
results %>% show_best("mape", n = 2)
# Tuning on the nested data set doesn't work =========================================================================================
data_ro_month <- data_set %>%
mutate(year_month = format(date, "%Y-%m")) %>%
nest(-year_month) %>%
rolling_origin(
initial = 10,
assess = 1,
cumulative = TRUE
)
results <- tune_grid(
model_workflow,
grid = grid,
resamples = data_ro_month,
param_info = params,
metrics   = metric_set(mae, mape, rmse, rsq),
control   = control_grid(verbose = TRUE))
results$.notes ```

我不完全清楚如何划分数据以进行调优,但我建议您研究其他一些示例函数,如sliding_window(),尤其是sliding_period()。它们可以让你创建实验性的调整设计,在那里你可以适应特定月份的数据,然后评估另一个月,沿着你所有可用的月份滑动:

library(tidymodels)
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)
set.seed(1)
data_set <- tibble(
date = dates,
v1 = rnorm(l),
v2 = rnorm(l),
v3 = rnorm(l),
y = rnorm(l)
)
month_folds <- data_set %>%
sliding_period(
date,
"month",
lookback = Inf,
skip = 4
)
month_folds
#> # Sliding period resampling 
#> # A tibble: 7 x 2
#>   splits           id    
#>   <list>           <chr> 
#> 1 <split [151/30]> Slice1
#> 2 <split [181/31]> Slice2
#> 3 <split [212/31]> Slice3
#> 4 <split [243/30]> Slice4
#> 5 <split [273/31]> Slice5
#> 6 <split [304/30]> Slice6
#> 7 <split [334/31]> Slice7

我在这里使用了skip = 4,只保留那些将有更多数据用于训练的切片。这些切片中的每一个都将根据几个月的数据进行训练,并对上个月的新数据进行评估。重采样在数据集中向前滑动。由于我使用了lookback = Inf,它总是包括所有过去的数据,但您可以更改它。

当你设置了适合你的领域问题的重采样方法时,你可以制定一个模型规范并对其进行调整:

rf_spec <-
rand_forest(
mode = "regression",
trees = tune()) %>%
set_engine("ranger")
rf_wf <-
workflow() %>%
add_model(rf_spec) %>%
add_formula(y ~ v1 + v2 + v3)
tune_grid(rf_wf, resamples = month_folds)
#> # Tuning results
#> # Sliding period resampling 
#> # A tibble: 7 x 4
#>   splits           id     .metrics          .notes          
#>   <list>           <chr>  <list>            <list>          
#> 1 <split [151/30]> Slice1 <tibble [20 × 5]> <tibble [0 × 1]>
#> 2 <split [181/31]> Slice2 <tibble [20 × 5]> <tibble [0 × 1]>
#> 3 <split [212/31]> Slice3 <tibble [20 × 5]> <tibble [0 × 1]>
#> 4 <split [243/30]> Slice4 <tibble [20 × 5]> <tibble [0 × 1]>
#> 5 <split [273/31]> Slice5 <tibble [20 × 5]> <tibble [0 × 1]>
#> 6 <split [304/30]> Slice6 <tibble [20 × 5]> <tibble [0 × 1]>
#> 7 <split [334/31]> Slice7 <tibble [20 × 5]> <tibble [0 × 1]>

由reprex包于2020-11-15创建(v0.3.09001(

最新更新