在重采样过程中进行子采样之后,如此处所示 https://topepo.github.io/caret/subsampling-for-class-imbalances.html#subsampling-during-resampling 我的问题只是当插入符号方法="rf"并且采样方法为"smote"时,如何提取此过程产生的实际数据集。
例如,如果使用method= glm,则可以使用model$finalModel$data提取数据;如果方法= "rpart",则可以使用model$finalModel$call$data类似地提取数据。
在重采样和方法=rpart中使用子采样,可以按如下方式推断smote数据集:
library(caret)
library(DMwR)
data("GermanCredit")
set.seed(122)
index1<-createDataPartition(GermanCredit$Class, p=.7, list = FALSE)
training<-GermanCredit[index1, ]
#testing<-GermanCredit[-index1,]
colnames(training)
metric <- "ROC"
ctrl1<- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
search = "random",
classProbs = TRUE, # note class probabilities included
savePredictions = T, #"final"
returnResamp = "final",
allowParallel = TRUE,
summaryFunction = twoClassSummary,
sampling = "smote")
set.seed(1)
mod_fit<-train(Class ~ Age +
ForeignWorker +
Property.RealEstate +
Housing.Own +
CreditHistory.Critical, data=training, method="rpart",
metric = metric,
trControl= ctrl1)
mod_fit # ROC 0.5951215
dat_smote<- mod_fit$finalModel$call$data
table(dat_smote$.outcome)
# Bad Good
# 630 840
head(dat_smote)
# Age ForeignWorker Property.RealEstate Housing.Own CreditHistory.Critical .outcome
# 40 1 0 1 1 Good
# 29 1 0 0 0 Good
# 37 1 1 0 1 Good
# 47 1 0 0 0 Good
# 53 1 0 1 0 Good
# 29 1 0 1 0 Good
我只是希望能够在方法 = "rf" 时执行相同的数据集提取。代码可能如下所示dat<- mod_fit$trainingData[mod_fit$trainingData == mod_fit$finalModel$x,]
我认为唯一的方法是编写一个自定义模型,将数据对象保存在fit
模块中(尽管这很不令人满意(。