有什么方法可以基于R中的wilcoxon检验来选择单变量特征



我打算使用care::sbf进行单变量特征选择,其中我的输入是具有多个变量(也称为列(、候选特征列表和标签(也称为分类变量(的数据帧。在阅读了caret包文档后,我尝试使用sbfsbfController进行功能选择,但遇到了以下错误:

contrasts<-中的错误(*tmp*,值=contr.funs[1+isOF[nn]](:
对比度只能应用于具有2个或更多水平的因素

有人能告诉我如何解决这个错误吗?使用caret::sbf进行特征选择的正确性是什么?有什么想法吗?

可复制示例

这是一个关于公共要点的可复制的例子,我把它作为输入。

我当前的尝试

library(caret)
library(e1071)
library(randomForest)
df=read.csv("df.csv", header=True)
sbfCtrl <- sbfControl(method = 'cv', number = 10, returnResamp = 'final', functions = caretFuncs, saveDetails = TRUE)
model <- sbf(form= ventil_status~ .,
data= df,
methods='knn',
trControl=trainControl(method = 'cv', classProbs = TRUE),
tuneGrid=data.frame(k=1:10),
sbfControl=sbfControl(functions = sbfCtrl,
methods='repeatedcv', number = 10, repeats = 10))
print(model)
print(model$fit$results)
> model <- sbf(ventil_status~ ., data=df, sizes=c(1,5,10,20),
+              method= 'knn', trControl=trainControl(method = 'cv', classProbs = TRUE),
+              tuneGrid = data.frame(k=1:10),
+              sbfControl=sbfCtrl)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
contrasts can be applied only to factors with 2 or more levels

我在谷歌上搜索了这个错误,但仍然无法克服。有没有办法让上面的代码正常工作?使用caret::sbf进行过滤器选择的正确方法是什么?

我想要的是输出数据帧必须具有附加p值的选定特性

newdf <- df[ , -which(names(df) %in% c("subject"))]
p_value_vector <- sapply(names(newdf), function(i) 
tryCatch(
wilcox.test(newdf[newdf$ventil_status %in% "0", i], 
newdf[newdf$ventil_status %in% "1", i], 
na.action(na.omit))$p.value),
warning = function(w) return(NA),
error = function (e) return(NA)
)

预期输出

我期望输出具有选定特征的数据帧,其中wilcox.test返回的p值应该附加到相应的特征。有没有办法在r中实现这一点?如何使用caret::sbf正确操作功能选择?有什么想法吗?

这是我的R会话信息:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] ggpubr_0.2.5        magrittr_1.5        reshape2_1.4.3     
[4] forcats_0.5.0       purrr_0.3.3         readr_1.3.1        
[7] tibble_2.1.3        tidyverse_1.3.0     stringr_1.4.0      
[10] dplyr_0.8.5         scales_1.1.0        tidyr_1.0.2        
[13] aws.s3_0.3.20       randomForest_4.6-14 e1071_1.7-3        
[16] mlbench_2.1-1       caret_6.0-86        ggplot2_3.3.0      
[19] lattice_0.20-38  

对于使用sbf,您可以使用插入符号sbf,然后根据您喜欢的定义添加分数和过滤器:

library(mlbench)
library(caret)
knnSBF = caretSBF
knnSBF$summary <- twoClassSummary
knnSBF$score <- function(x, y) {
wilcox.test(x ~ y)$p.value
}
knnSBF$filter <- function(score, x, y) {
score <= 0.05
}

然后定义训练参数和sbf参数:

sbfCtrl <- sbfControl(method = "cv",number = 3,
functions = knnSBF,saveDetails = TRUE)
trn_grid <- expand.grid(k=c(2,6,10))
trCtrl <-  trainControl(method = "cv",number = 3,
classProbs = TRUE,verboseIter = TRUE)

然后运行列车:

data(Sonar)
y = Sonar$Class
x = Sonar[,-ncol(Sonar)]
set.seed(111)
model1 <- sbf(x,y,trControl = trCtrl,
sbfControl = sbfCtrl,
method = "knn",
tuneGrid = trn_grid)
model1$variables
$selectedVars
[1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V8"  "V9"  "V10" "V11" "V12" "V13"
[13] "V14" "V20" "V21" "V22" "V36" "V37" "V42" "V43" "V44" "V45" "V46" "V47"
[25] "V48" "V49" "V50" "V51" "V52" "V54" "V58"
$selectedVars
[1] "V4"  "V5"  "V6"  "V9"  "V10" "V11" "V12" "V13" "V14" "V20" "V21" "V22"
[13] "V28" "V31" "V34" "V35" "V36" "V37" "V43" "V44" "V45" "V46" "V47" "V48"
[25] "V49" "V51" "V52"
$selectedVars
[1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
[13] "V13" "V14" "V21" "V22" "V23" "V34" "V35" "V36" "V37" "V43" "V44" "V45"
[25] "V46" "V47" "V48" "V49" "V50" "V51" "V52" "V53" "V56" "V58"

我不认为他们会给你p值,尽管我可能错了。使用上面的示例计算p值

p_value_vector <- apply(x,2,function(i)wilcox.test(i~y)$p.value)

最新更新