R - 插入符号火车() "Error: Stopping" 带"Not all variable names used in object found in newdata"



我正在尝试为蘑菇数据构建一个简单的Naive Bayes分类器。我想用所有的变量作为分类预测因子来预测蘑菇是否可食用。

我正在使用插入符号包。

这是我的完整代码:

##################################################################################
# Prepare R and R Studio environment
##################################################################################
# Clear the R studio console
cat("14")
# Remove objects from environment
rm(list = ls())
# Install and load packages if necessary
if (!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require(caret)) {
install.packages("caret")
library(caret)
}
if (!require(klaR)) {
install.packages("klaR")
library(klaR)
}
#################################
mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)
na.omit(mushrooms)
names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")
# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'
set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)
train <- mushrooms[split, ]
test <- mushrooms[-split, ]
predictors <- names(train)[2:20] #Create response and predictor data
x <- train[,predictors] #predictors
y <- train$edibility #response
train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation
edibility_mod1 <- train( #train the model
x = x,
y = y,
method = "nb", 
trControl = train_control
)

当执行train((函数时,我得到以下输出:

Something is wrong; all the Accuracy metric values are missing:
Accuracy       Kappa    
Min.   : NA   Min.   : NA  
1st Qu.: NA   1st Qu.: NA  
Median : NA   Median : NA  
Mean   :NaN   Mean   :NaN  
3rd Qu.: NA   3rd Qu.: NA  
Max.   : NA   Max.   : NA  
NA's   :2     NA's   :2    
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) : 
Not all variable names used in object found in newdata

2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds

3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
There were missing values in resampled performance measures.

脚本运行后的x和y:

> str(x)
'data.frame':   6500 obs. of  19 variables:
$ capShape                : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ capSurface              : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap-color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises                 : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
$ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill-attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill-spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill-size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill-color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk-shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk-root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-color-above-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk-color-below-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil-type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ veil-color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring-number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ ring-type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...

> str(y)
Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...

我的环境是:

> R.version
_                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out     
> RStudio.Version()
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, PBC},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
}

$mode
[1] "desktop"
$version
[1] ‘1.3.1093’
$release_name
[1] "Apricot Nasturtium"

您试图做的是一个有点棘手的、最天真的bayes实现,或者至少您正在使用的实现(来自从e1071派生的kLAR(使用正态分布。您可以在e1071:的naiveBayes帮助页面的详细信息下看到

标准的朴素贝叶斯分类器(至少在这个实现中(假设预测变量独立,并且高斯度量预测器的分布(给定目标类别(。对于缺少值的属性,相应的表条目为为了预测而省略。

而且你的预测因子是分类的,所以这可能有问题。您可以尝试设置kernel=TRUEadjust=1以强制其正常,并避免kernel=FALSE会引发错误。

在此之前,我们删除只有1级的列,并对列名进行排序,在这种情况下,使用公式更容易,避免生成伪变量:

df = train 
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
mod1 <- train(edibility~.,data=df,
method = "nb", trControl = trainControl(method="cv",number=5),
tuneGrid=Grid
)
mod1
Naive Bayes 
6500 samples
21 predictor
2 classes: 'e', 'p' 
No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200 
Resampling results across tuning parameters:
fL   Accuracy   Kappa    
0.2  0.9243077  0.8478624
0.5  0.9243077  0.8478624
0.8  0.9243077  0.8478624
Tuning parameter 'usekernel' was held constant at a value of TRUE
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
adjust = 1.

而不是使用"nb";在列车功能中。使用";naive_bayes";。即使有严重的阶级失衡问题,它也对我有效。

最新更新