我有一个数据集,它被建模为具有混合效应的零膨胀负二项式。我想从模型预测中获取置信区间,并绘制模型均值和置信区间。我试图绘制模型均值。有人可以让我知道这是否是正确的方法吗?我不知道如何将模型的置信区间绘制到 ggplot2 上。我想绘制数据的预测平均值及其置信区间。我在下面的代码中对图形进行了基本尝试。
library(pscl)
library(lmtest)
df <- data.frame(
fertilizer = c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P"),
level = c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low"),
repro = c(0,90,2,4,0,80,1,90,2,33,56,0,99,100,66,80,1,0,2,33,0,0,1,2,90,5,2,2,5,8,0,1,90,2,4,66,0,0,0,0,1,2,90,5,2,5,8,55)
)
model <- formula(repro ∼ fertilizer + level | fertilizer * level)
modelzinb <- zeroinfl(model, dist = "negbin", link = "logit",data =df)
summary(modelzinb)
df$predict <- predict(modelzinb)
ggplot(df, aes(x=fertilizer, y=predict, color = fertilizer)) + theme_bw() + stat_summary(aes(color = fertilizer),fun.y = mean, geom = "point", size = 4, position = position_dodge(0.1)) +
scale_x_discrete("Fertlizer") +
facet_wrap(.~level)
我不确定你想画什么。但我可以向您展示如何计算零膨胀负二项式模型的预测置信区间。
请注意,我无法评论拟合的质量,也无法评论将此类模型拟合到您的数据是否有意义。要回答这类问题,需要我所没有的特定领域知识。
让我们首先将模型拟合到您的数据。
library(pscl) model <- formula(repro ~ fertilizer + level | fertilizer * level) modelzinb <- zeroinfl(model, dist = "negbin", link = "logit", data = df)
我们可以使用
predict
来获取响应的模型估计值。resp <- predict(modelzinb)
请注意,在零膨胀的 NB 模型中,
predict.zeroinfl
(默认情况下(返回观测计数repro
的尺度上的估计平均响应。关于置信区间(CI(,这里的"问题"是
precict.zeroinfl
没有允许直接计算置信区间的interval
参数。附带说明:似乎pscl::zeroinfl
曾经包含该功能,请参阅版本 0.54 的文档。也许值得就此联系软件包维护者。最重要的是,我们必须找到一种不同的方法来计算预测置信区间。为此,我们使用引导。R库
boot
提供了所有必要的功能和工具来做到这一点。首先,我们可以使用函数
boot::boot
来生成预测响应的自举复制。boot::boot
需要一个基于数据生成感兴趣数量(此处为预测响应(的函数。所以我们首先定义这样一个函数stat
。在这种特殊情况下,stat
必须采用两个参数:第一个参数是原始数据,第二个参数是行索引的向量(定义数据的随机引导样本(。然后,该函数将使用自举数据来拟合模型,然后使用该模型根据完整数据预测平均响应。stat <- function(df, inds) { model <- formula(repro ~ fertilizer + level | fertilizer * level) predict( zeroinfl(model, dist = "negbin", link = "logit", data = df[inds, ]), newdata = df) }
为了确保结果的可重复性,我们设置了一个固定的随机种子,并为估计的平均响应生成了 100 个自举样本
set.seed(2018) library(boot) res <- boot(df, stat, R = 100)
输出对象
res
是一个list
,其中包含元素t0
中完整数据的估计平均响应(用all.equal(res$t0, predict(modelzinb))
验证(和元素t
中自举样本的估计平均响应(这是维度R x nrow(df)
的矩阵,其中R
是自举样本的数量(。剩下要做的就是根据拟合到自举样本的模型,根据估计的平均响应计算置信区间。为此,我们可以使用方便的功能
boot::boot.ci
.该函数允许计算不同类型的置信(基本、正常、BCa 等(。在这里,我们使用"basic"
进行演示。我不认为这是最好的选择。boot::boot.ci
采用一个index
参数,该参数对应于估计平均响应向量的输入。实际间隔存储在作为boot.ci
返回对象的list
元素存储的矩阵的最后 2 列(第 4 列和第 5 列(中。CI <- setNames(as.data.frame(t(sapply(1:nrow(df), function(row) boot.ci(res, conf = 0.95, type = "basic", index = row)$basic[, 4:5]))), c("lower", "upper"))
现在我们已经完成了,我们可以将置信区间与原始数据(包括预测的平均响应值(进行列绑定。
df_all <- cbind.data.frame(df, response = predict(modelzinb), CI) # fertilizer level repro response lower upper #1 N low 0 29.72057 5.876731 46.48165 #2 N low 90 29.72057 5.876731 46.48165 #3 N high 2 21.99345 -15.228956 38.86421 #4 N high 4 21.99345 -15.228956 38.86421 #5 N low 0 29.72057 5.876731 46.48165 #6 N low 80 29.72057 5.876731 46.48165 #7 N high 1 21.99345 -15.228956 38.86421 #8 N high 90 21.99345 -15.228956 38.86421 #9 N low 2 29.72057 5.876731 46.48165 #10 N low 33 29.72057 5.876731 46.48165 #11 N high 56 21.99345 -15.228956 38.86421 #12 N high 0 21.99345 -15.228956 38.86421 #13 P low 99 24.22626 -9.225827 46.17656 #14 P low 100 24.22626 -9.225827 46.17656 #15 P high 66 22.71826 2.595246 39.88333 #16 P high 80 22.71826 2.595246 39.88333 #17 P low 1 24.22626 -9.225827 46.17656 #18 P low 0 24.22626 -9.225827 46.17656 #19 P high 2 22.71826 2.595246 39.88333 #20 P high 33 22.71826 2.595246 39.88333 #21 P low 0 24.22626 -9.225827 46.17656 #22 P low 0 24.22626 -9.225827 46.17656 #23 P high 1 22.71826 2.595246 39.88333 #24 P high 2 22.71826 2.595246 39.88333 #25 N low 90 29.72057 5.876731 46.48165 #26 N low 5 29.72057 5.876731 46.48165 #27 N high 2 21.99345 -15.228956 38.86421 #28 N high 2 21.99345 -15.228956 38.86421 #29 N low 5 29.72057 5.876731 46.48165 #30 N low 8 29.72057 5.876731 46.48165 #31 N high 0 21.99345 -15.228956 38.86421 #32 N high 1 21.99345 -15.228956 38.86421 #33 N low 90 29.72057 5.876731 46.48165 #34 N low 2 29.72057 5.876731 46.48165 #35 N high 4 21.99345 -15.228956 38.86421 #36 N high 66 21.99345 -15.228956 38.86421 #37 P low 0 24.22626 -9.225827 46.17656 #38 P low 0 24.22626 -9.225827 46.17656 #39 P high 0 22.71826 2.595246 39.88333 #40 P high 0 22.71826 2.595246 39.88333 #41 P low 1 24.22626 -9.225827 46.17656 #42 P low 2 24.22626 -9.225827 46.17656 #43 P high 90 22.71826 2.595246 39.88333 #44 P high 5 22.71826 2.595246 39.88333 #45 P low 2 24.22626 -9.225827 46.17656 #46 P low 5 24.22626 -9.225827 46.17656 #47 P high 8 22.71826 2.595246 39.88333 #48 P low 55 24.22626 -9.225827 46.17656 #...
示例数据
df <- data.frame(
fertilizer = c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P"),
level = c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low"),
repro = c(0,90,2,4,0,80,1,90,2,33,56,0,99,100,66,80,1,0,2,33,0,0,1,2,90,5,2,2,5,8,0,1,90,2,4,66,0,0,0,0,1,2,90,5,2,5,8,55))