如何使用矩阵列的均值作为R线性回归中的预测值?

问题陈述:data(gasoline, package="pls").可以得到60个汽油样品的近红外光谱和相应的辛烷值，计算每个频率的平均值，并预测使用问题4中五种不同方法的最佳模型的响应。

注意:这是Julian Faraway在Linear Models with R，第2版中的练习11.5。还有，问题4中的"五种不同的方法"分别是:带所有预测因子的线性回归、使用AIC选择变量的线性回归、主成分回归、偏最小二乘法和岭回归。

My Work So Far:我们做

require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)

到目前为止，一切顺利。然而，如果我看一下模型的摘要，我发现由于奇点，348个系数没有定义。(为什么?)此外，将NIR矩阵(预测器)列的平均值转换为可接受的数据帧被证明是困难的。

我的问题:我怎么才能达到高度挑剔的predict函数让我做这样的事情:

new_data = apply(train_gas$NIR, 2, mean)
*some code here*
predict(lmod, new_data)

顺便说一句，因为我在统计上做了大量的调节。SE，我可以肯定地说，这个问题将在Stats上关闭。SE跑题了。这是一个"编程或数据请求"，因此在Stats.SE上不受欢迎。

我也查了一些关于SO的相关问题，但似乎没有一个是完全合适的。

对我来说，这似乎是相当交叉验证的…gasoline是一个相当奇怪的对象，包含一个"列"(元素)，它是一个401列矩阵:

data.frame':    60 obs. of  2 variables:
$ octane: num  85.3 85.2 88.5 83.4 87.9 ...
$ NIR   : 'AsIs' num [1:60, 1:401] -0.0502 -0.0442 -0.0469 -0.0467 -0.0509 ...

然而，最根本的问题是这是一个p>>n问题;有60项观察和401项预测。因此，标准的线性回归可能没有意义——你可能想使用像LASSO/ridge(即glmnet)这样的惩罚方法。这就是为什么你得到未定义的系数(没有某种惩罚，你不能估计402个系数(ncols + 1的截距)从60个观察…)

然而，如果我们真的想把它变成一个形状，我们可以做线性模型和预测(尽管不明智):

NIR <- gasoline$NIR
class(NIR) <- "matrix" ## override "AsIs" class
g2 <- data.frame(octane = gasoline$octane, NIR)
dim(g2) ## 60 402 - now this is a 'regular' data frame

## using train_index from above
train_gas <- g2[train_index,]
lmod = lm(octane~., train_gas)
## drop first column (response); use `lapply()` to maintain list structure
new_data <- as.data.frame(lapply(train_gas[-1], mean))
predict(lmod, new_data)
##        1 
## 87.16019 
## Warning message:
## In predict.lm(lmod, new_data) :
##   prediction from a rank-deficient fit may be misleading

略有更直接的方法(但仍然很难看)是将模型拟合到原始的怪异结构，并构建一个与该怪异结构匹配的预测框架，即

pp <- data.frame(NIR=I(matrix(colMeans(train_gas$NIR), nrow = 1)))

如果你愿意放弃predict()，你可以这样做:

sum(na.omit(coef(lmod) * c(1, colMeans(train_gas$NIR))))

相关内容

最新更新

热门标签：