r语言 - 使用ranger计算多重分类的混淆矩阵或关联表时出错



我调用ranger来建模一个大型混合数据框架的多分类问题(其中一些分类变量具有超过53个级别)。培训和测试运行没有任何问题。然而,解释混淆矩阵/列联表会出现问题。

我正在使用虹膜数据来解释我所面临的困难,通过将物种作为分类变量,

library(ranger)
library(caret)
# Data
idx = sample(nrow(iris),100)
data = iris
# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]
# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

遇到了以下困难:

table(Test_Set$Species, probabilitiesSpecies$predictions)
Error in table(Test_Set$Species, probabilitiesSpecies$predictions) : 
all arguments must have the same length

caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.

然而,下面所示的双分类是有效的:

idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))

如何解决多分类得到混淆矩阵的问题?我也把它作为一个单独的线程(使用ranger计算多分类混淆矩阵时出错)

ranger-文档中,当probabilities = TRUE

有了概率选项和因子因变量,形成了一个概率森林。在这里,节点杂质用于分裂,就像在分类林中一样。预测是每个样本的类概率。与其他实现不同的是,每棵树返回一个概率估计值,这些估计值被平均为森林概率估计值。详情见Malley et al.(2012)。

Ie。当设置为TRUE时,您将获得概率估计,然后您可以根据自己的阈值进行分类。但是,如果设置为FALSE,我不知道默认的决策规则。

无论如何,你的方法应该如下:

Species.ranger <- ranger(
Species ~ .,
data = Train_Set,
importance ="impurity",
save.memory = TRUE, 
probability = FALSE
)

然后可以按照以下方式评估confusionMatrix中的性能,

probabilitiesSpecies <- predict(
Species.ranger,
data = Test_Set,
verbose = TRUE
)
table(
probabilitiesSpecies$predictions,
Test_Set$Species
) %>% confusionMatrix()

Confusion Matrix and Statistics

setosa versicolor virginica
setosa         17          0         0
versicolor      0         16         1
virginica       0          0        16
Overall Statistics

Accuracy : 0.98            
95% CI : (0.8935, 0.9995)
No Information Rate : 0.34            
P-Value [Acc > NIR] : < 2.2e-16       

Kappa : 0.97            

Mcnemar's Test P-Value : NA              
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            1.0000           0.9412
Specificity                   1.00            0.9706           1.0000
Pos Pred Value                1.00            0.9412           1.0000
Neg Pred Value                1.00            1.0000           0.9706
Prevalence                    0.34            0.3200           0.3400
Detection Rate                0.34            0.3200           0.3200
Detection Prevalence          0.34            0.3400           0.3200
Balanced Accuracy             1.00            0.9853           0.9706

相关内容

最新更新