R-室:使用RFE功能时无法调整性能指标



我正在尝试使用rfe函数执行递归功能消除,但是我遇到了一些麻烦,试图更改性能度量以输出ROC:

newFunc <- caretFuncs
newFunc$summary <- twoClassSummary 
ctrl <- rfeControl(functions = newFunc, 
                   method = 'cv',
                   returnResamp = TRUE,
                   number = 2,
                   verbose = TRUE)
profiler <- rfe(predictors, response, 
                sizes = c(1), 
                method = 'nnet',
                tuneGrid = expand.grid(size=c(4), decay=c(0.1)), 
                maxit = 20,
                metric = 'ROC', 
                rfeControl = ctrl) 

试图运行此代码的错误是我的以下错误:

{:任务1失败的错误 - "选择的未定义列"

如果我删除了自定义newFunc,请在rfeControl内部设置functions参数以使用caretFuncs并从rfe中删除metric参数,该模型可以正常工作。这使我认为摘要有问题。

caretfuncs $摘要:

function (data, lev = NULL, model = NULL) 
{
    if (is.character(data$obs)) 
        data$obs <- factor(data$obs, levels = lev)
    postResample(data[, "pred"], data[, "obs"])
}

TwoClasssummary

function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    data$y = as.numeric(data$obs == lvls[2])
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

postResampletwoClassSummary的输出在它们的结构上是相同的,因此我对这个问题有些失落。我是在这里固有地做错了什么,还是我需要标记为开发人员的错误?


我实际上有兴趣获得logLoss,因此我可以写自己的功能:

logLoss = function(data, lev = NULL, model = NULL) {
  -1*mean(log(data[, 'pred'][model.matrix(~ as.numeric(data[, 'obs'], levels = lev) + 0) - data[, 'pred'] > 0]))
}

但是,我有点不确定如何从我的[yes, no]因子转换为正确的[0,1]

首先,这里是可行的logloss函数,可与caret一起使用:

LogLoss <- function (data, lev = NULL, model = NULL) 
{ 
  obs <- data[, "obs"]
  cls <- levels(obs) #find class names
  probs <- data[, cls[2]] #use second class name
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLoss")
  out
}

回答如何将因子级别转换为我的[是,否]因子的正确[0,1]的问题:

real <- (as.numeric(data$obs) - 1)

要使rfe工作,您可以使用rfFuncs而不是caretFuncs。示例:

rfFuncs$summary <- twoClassSummary
ctrl <- rfeControl(functions = rfFuncs, 
                   method = 'cv',
                   returnResamp = TRUE,
                   number = 2,
                   verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class, 
                sizes = c(1, 5, 20, 40, 60), 
                method = 'nnet',
                tuneGrid = expand.grid(size=c(4), decay=c(0.1)), 
                maxit = 20,
                metric = 'ROC', 
                rfeControl = ctrl)
profiler$results
  Variables       ROC      Sens      Spec      ROCSD      SensSD      SpecSD
1         1 0.6460027 0.6387987 0.5155187 0.08735968 0.132008571 0.007516016
2         5 0.7563971 0.6847403 0.7013180 0.03751483 0.008724045 0.039383924
3        20 0.8633511 0.8462662 0.7017432 0.08460677 0.091143309 0.097708207
4        40 0.8841540 0.8642857 0.7429847 0.08096697 0.090913729 0.098309489
5        60 0.8945351 0.9004870 0.7431973 0.05707867 0.064971175 0.127471631

或我提供的Logloss函数:

rfFuncs$summary <- LogLoss
ctrl <- rfeControl(functions = rfFuncs, 
                   method = 'cv',
                   returnResamp = TRUE,
                   number = 2,
                   verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class, 
                sizes = c(1, 5, 20, 40, 60), 
                method = 'nnet',
                tuneGrid = expand.grid(size=c(4), decay=c(0.1)), 
                maxit = 20,
                metric = 'LogLoss', 
                rfeControl = ctrl,
                maximize = FALSE) #this was edited after the answer of Дмитрий Пасько) 
profiler$results
  Variables   LogLoss   LogLossSD
1         1 1.8237372 1.030120134
2         5 0.5548774 0.128704686
3        20 0.4226522 0.021547998
4        40 0.4167819 0.013587892
5        60 0.4328718 0.008000892

编辑:дитрийпасько在他的答案中提出了一个有效的关注 - 应该最大程度地减少logloss。实现此目的的一种方法是提供逻辑参数maximize告诉室,应将度量列表最小化或最大化。

,但是u应该最小化logloss,因此使用此代码(示例带有逻辑回归https://www.kaggle.com/demetrypascal/rfe-lfe-logreg-with-with-pca-and-pca-and-pca--feature--重要性(:

LogLoss <- function (data, lev = NULL, model = NULL) 
{ 
  obs <- data[, "obs"]
  cls <- levels(obs) #find class names
  probs <- data[, cls[2]] #use second class name
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLossNegative")
  -out
}
lrFuncs$summary <- LogLoss
rfec = rfeControl(method = "cv",
                     number = 2,
                     functions = lrFuncs)

最新更新