r-手动计算交叉验证会得到不同的结果



让我们获取数据:

set.seed(42)
y <- rnorm(125)
x <- data.frame(runif(125), rexp(125))

我想对它进行2倍交叉验证。所以:

library(caret)
model <- train(y ~ .,
data = cbind(y, x), method = "lm",
trControl = trainControl(method = "cv", number = 2)
)
model 
Linear Regression 
125 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 63, 62 
Resampling results:
RMSE      Rsquared     MAE      
1.091108  0.002550859  0.8472947
Tuning parameter 'intercept' was held constant at a value of TRUE

我想手动获得上面的RMSE值,以确保我完全理解交叉验证。

我到目前为止的工作

正如我在上面看到的,我的样本被分为两部分:62(1倍(和63(第二倍(。

#Training first model basing on first fold
model_1 <- lm(y[1:63] ~ ., data = x[1:63, ])
#Calculating RMSE for the first model
RMSE_1 <- RMSE(y[64:125], predict(model_1, newdata = x[64:125, ]))
#Training second model basing on second fold
model_2 <- lm(y[64:125] ~ ., data = x[64:125, ])
#Calculating RMSE for the second model
RMSE_2 <- RMSE(y[1:63], predict(model_1, newdata = x[1:63, ]))
mean(c(RMSE_1, RMSE_2))
1.023411

我的问题是——为什么我得到了不同的RMSE?这个误差太大了,可以被视为估计误差——当然他们是用另一种方式计算的。你知道我有什么不同吗?

您使用的逻辑是正确的,但需要进行两个更改:

  1. Caret将创建自己的2倍数据用于训练。不会是1:63、64:125,但插入符号会根据种子生成它们
  2. RMSE_2中有一个打字错误,应该是model_2

这是更新的代码:

# the folds are kept in this part of the output (trial and error to find it haha)
model$control$index
f1 <- model$control$index[[1]]
f2 <- model$control$index[[2]]
# re-do your calculations but using the fold indexes, plus typo for RMSE_2
model_1 <- lm(y[f1] ~ ., data = x[f1, ])
#Calculating RMSE for the first model
RMSE_1 <- RMSE(y[f2], predict(model_1, newdata = x[f2, ]))
#Training second model basing on second fold
model_2 <- lm(y[f2] ~ ., data = x[f2, ])
#Calculating RMSE for the second model
RMSE_2 <- RMSE(y[f1], predict(model_2, newdata = x[f1, ]))
# matches now
mean(c(RMSE_1, RMSE_2))

相关内容

  • 没有找到相关文章

最新更新