如何计算在R中使用CARET训练的模型的95%置信区间?



我使用R包caret构建了不同的回归模型。我们如何计算预测的95%置信区间?我已经按照这里提到的讨论进行了,但是,它不起作用。

rm(list = ls())
library(caret)
data("mtcars")
Train_data = mtcars[1:26, -c(8,9)]
Test_data = mtcars[27:32, -c(8,9)]

set.seed(100)
model_pls <- train(
hp ~ ., 
data = Train_data, 
tuneLength = 5, 
method = "pls", 
metric = "RMSE", 
preProcess = c('center', 'scale'), 
trControl = trainControl(
method = "repeatedcv", 
number = 5, 
repeats = 3, 
savePredictions = "final"
)
)
model_rf <- train(
hp ~ ., 
data = Train_data, 
tuneLength = 5, 
method = "ranger", 
metric = "RMSE", 
preProcess = c('center', 'scale'), 
trControl = trainControl(
method = "repeatedcv", 
number = 5, 
repeats = 3, 
savePredictions = "final"
)
)
model_svmr <- train(
hp ~ ., 
data = Train_data, 
tuneLength = 8, 
method = "svmRadial", 
metric = "RMSE", 
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv", 
number = 5, 
repeats = 3,
)
)
# This does not generate confidence interval
PLS.pred = predict(model_pls, subset(Test_data, select = -hp))  
RF.pred = predict(model_rf, subset(Test_data, select = -hp)) 
RF.svm = predict(model_svmr , subset(Test_data, select = -hp)) 

# This is not working
predict(model_pls$finalModel, subset(Test_data, select = -hp), interval = "confidence")
predict(model_rf$finalModel, subset(Test_data, select = -hp), interval = "confidence")
predict(model_svmr$finalModel, subset(Test_data, select = -hp), interval = "confidence")

根据Michael Matta的建议,我尝试了下面的代码,但是,它并没有像预期的那样工作。

confint(model_pls, level = 0.95)
# Error in UseMethod("vcov"): no applicable method for 'vcov'
predict(model_pls, subset(Test_data, select = -hp), interval = "confidence")
# 64.47807  57.97479 151.59713 130.24356 183.20296  88.50035
# This does not show the CI.

置信区间要么来自已知分布和以下统计数据,要么使用重采样构建。RBF SVM、随机森林等没有已知的分布,也就是说,它们不能在任何上产生置信区间,因为它们与线性模型(lm)的方式相同。

从这样的模型中获得置信区间的方法是重新采样训练/测试数据集,重新训练,收集你需要的值(例如,使用for循环)。然后,从这些收集到的数据中,通过已知的均值分布估计出期望值置信区间。


下面的伪代码应该适用于几乎任何您想要的分数(准确性,RMSE,…;有关评论,请参见下文):

predictionsTrainAll <- c()
predictionsTestAll <- c() 
scoresTrain <- c()
scoresTest <- c()
for( i in 1:1000){
d <- shuffle the original dataset,
training <- draw training dataset from d,
testing  <- draw testing datassetfrom d (such that training and testing do not have any intersection),

model <- train a model on training data,
predictionsTrain <- make predictions for training data,
predictionsTest  <- make predictions for testing data,
scoreTrain <- evaulate model and obtain any score you like on train,
scoreTest  <- evaluate model and obtain any score you like on test,

predictionsTrainAll <- append(predictionsTrainAll, predictionsTrain)
predictionsTestAll <- append(predictionsTestAll, predictionsTest)
scoresTrain <- append(scoresTrain, scoreTrain)
scoresTest  <- append(scoresTest, scoreTest)
}

现在,我们可以估计scorestrict和scoresTest的期望值。根据中心极限定理,我们可以假设期望值具有正态分布(或t分布,因为这里的样本有限)。我们可以使用:

# scores should be /somehow/ normally distributed (symmetric by mean, meadian close to the mean)
hist(predictionsTrainAll)
hist(predictionsTestAll)
hist(scoresTrain)    
hist(scoresTest)     
# if the histogram are /somehow/ normal:
t.test(predictionsTrainAll)
t.test(predictionsTestAll)
t.test(scoresTrain)
t.test(scoresTest) 

,它将计算预测值和您想要的任何分数的期望值(真实平均值)的95%置信区间。但要注意,如果直方图是倾斜的,对平均值的估计可能是有缺陷的,并产生错误的置信区间。

二元分类器的示例:预测的估计真实平均值为0,95% CI =[-0.32, 0.32],因为模型预测的是零。然而,预测结果只能在[0;1]因此CI的负面部分是没有意义的。这样的CI是正态/t分布所隐含的对称性的结果。当检查分数/预测的直方图不是正态分布时,就会发生这种情况。

最新更新