在一些分类任务中,使用mlr
包,我需要处理与此类似的data.frame
:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
我需要交叉验证模型与相同的ID
保持在一起的值,我从教程中知道:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html further-settings
我们可以在任务中包含一个阻塞因素。这将表明一些观测值"属于一起",在将数据分成训练集和测试集进行重采样时不应该分开。
问题是如何在makeClassifTask
中加入这个阻断因子?
遗憾的是,我找不到任何例子
你用的是什么版本的mlr ?从一段时间以来,阻塞应该是它的一部分。您可以直接在makeClassifTask
下面是一个数据示例:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
@jakob-r的答案不再有效。我猜是cv10发生了一些变化。
小编辑使用"blocking "。cv = TRUE"
完整的工作示例:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})