错误:R语言预处理插补过程中的BoxCox错误



我正在看《应用预测建模》一书Max Kuhn中第6章练习3问题的答案,我在插补预测步骤中遇到了错误(尽管我完全遵循了他们的答案(。可复制的代码和问题如下:

library(AppliedPredictiveModeling)
library(caret)
library(RANN)
data(ChemicalManufacturingProcess)
predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")
# Impute
#Split data into training and test sets
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)

trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]

#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute"))
ppTrainPredictors <- predict(pp,newdata=trainPredictors)
ppTestPredictors <- predict(pp,newdata=testPredictors) # This results in an error

它给出的错误是:Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, : NA/NaN/Inf in foreign function call (arg 2)

当我使用YeoJohnson变换时,它似乎有效(我读到它能够处理非正数(

然而,我不明白为什么它不处理测试数据,因为它只是训练数据的不同子集?它只是用于问题的插补步骤?

我找不到任何答案,这似乎很奇怪,其他读过这本书的人肯定会注意到吗?还是我太胖了?

谢谢

您会得到这个错误,因为boxcox转换不接受零。如果你查看BoxCoxTrans的帮助页面,它会写道:

如果有(y<=0(或如果长度(unique(y((<numUnique,lambda不是估计并且不应用变换。

因此,如果您的preProcess()在列中没有零的训练集上运行,则会应用boxcox变换,但它不会在有零的测试集上工作。

在上面的书的例子中,种子很可能是用旧的R版本设置的,所以它有效。如果您使用的是R的新版本,那么它就不起作用。因此,如果我以你的例子进行检查:

cbind(colSums(trainPredictors==0,na.rm=TRUE),colSums(testPredictors==0,na.rm=TRUE)) 
[,1] [,2]
BiologicalMaterial01      0    0
BiologicalMaterial02      0    0
BiologicalMaterial03      0    0
BiologicalMaterial04      0    0
BiologicalMaterial05      0    0
BiologicalMaterial06      0    0
BiologicalMaterial07      0    0
BiologicalMaterial08      0    0
BiologicalMaterial09      0    0
BiologicalMaterial10      0    0
BiologicalMaterial11      0    0
BiologicalMaterial12      0    0
ManufacturingProcess01    1    2
ManufacturingProcess02   29    6
ManufacturingProcess03    0    0
ManufacturingProcess04    0    0
ManufacturingProcess05    0    0
ManufacturingProcess06    0    0
ManufacturingProcess07    0    0
ManufacturingProcess08    0    0
ManufacturingProcess09    0    0
ManufacturingProcess10    0    0
ManufacturingProcess11    0    0
ManufacturingProcess12  104   38
ManufacturingProcess13    0    0
ManufacturingProcess14    0    0
ManufacturingProcess15    0    0
ManufacturingProcess16    1    0
ManufacturingProcess17    0    0
ManufacturingProcess18    1    0

你可以看到ManufacturingProcess16ManufacturingProcess18会给你带来问题。

杨-约翰逊变换可以处理零或负值,所以它不是问题。

如果你想继续这个工作示例,你可以尝试使用另一个种子:

set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)

trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]

相关内容

最新更新