如何在 ggplot2 R 中为零膨胀负二项式函数添加置信区间?



我有一个数据集,它被建模为具有混合效应的零膨胀负二项式。我想从模型预测中获取置信区间,并绘制模型均值和置信区间。我试图绘制模型均值。有人可以让我知道这是否是正确的方法吗?我不知道如何将模型的置信区间绘制到 ggplot2 上。我想绘制数据的预测平均值及其置信区间。我在下面的代码中对图形进行了基本尝试。

library(pscl)
library(lmtest)
df <- data.frame(
fertilizer = c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P"), 
level = c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low"), 
repro = c(0,90,2,4,0,80,1,90,2,33,56,0,99,100,66,80,1,0,2,33,0,0,1,2,90,5,2,2,5,8,0,1,90,2,4,66,0,0,0,0,1,2,90,5,2,5,8,55)
)    
model <- formula(repro ∼ fertilizer + level | fertilizer * level)
modelzinb <- zeroinfl(model, dist = "negbin", link = "logit",data =df)
summary(modelzinb)
df$predict <- predict(modelzinb)
ggplot(df, aes(x=fertilizer, y=predict, color = fertilizer)) + theme_bw() +  stat_summary(aes(color = fertilizer),fun.y = mean, geom = "point", size = 4, position = position_dodge(0.1)) +
scale_x_discrete("Fertlizer") +
facet_wrap(.~level)  

我不确定你想画什么。但我可以向您展示如何计算零膨胀负二项式模型的预测置信区间。

请注意,我无法评论拟合的质量,也无法评论将此类模型拟合到您的数据是否有意义。要回答这类问题,需要我所没有的特定领域知识。

  1. 让我们首先将模型拟合到您的数据。

    library(pscl)
    model <- formula(repro ~ fertilizer + level | fertilizer * level)
    modelzinb <- zeroinfl(model, dist = "negbin", link = "logit", data = df)
    
  2. 我们可以使用predict来获取响应的模型估计值。

    resp <- predict(modelzinb)
    

    请注意,在零膨胀的 NB 模型中,predict.zeroinfl(默认情况下(返回观测计数repro的尺度上的估计平均响应。

    关于置信区间(CI(,这里的"问题"是precict.zeroinfl没有允许直接计算置信区间的interval参数。附带说明:似乎pscl::zeroinfl曾经包含该功能,请参阅版本 0.54 的文档。也许值得就此联系软件包维护者。

    最重要的是,我们必须找到一种不同的方法来计算预测置信区间。为此,我们使用引导。R库boot提供了所有必要的功能和工具来做到这一点。

  3. 首先,我们可以使用函数boot::boot来生成预测响应的自举复制。boot::boot需要一个基于数据生成感兴趣数量(此处为预测响应(的函数。所以我们首先定义这样一个函数stat。在这种特殊情况下,stat必须采用两个参数:第一个参数是原始数据,第二个参数是行索引的向量(定义数据的随机引导样本(。然后,该函数将使用自举数据来拟合模型,然后使用该模型根据完整数据预测平均响应。

    stat <- function(df, inds) {
    model <- formula(repro ~ fertilizer + level | fertilizer * level)
    predict(
    zeroinfl(model, dist = "negbin", link = "logit", data = df[inds, ]),
    newdata = df)
    }
    

    为了确保结果的可重复性,我们设置了一个固定的随机种子,并为估计的平均响应生成了 100 个自举样本

    set.seed(2018)
    library(boot)
    res <- boot(df, stat, R = 100)
    

    输出对象res是一个list,其中包含元素t0中完整数据的估计平均响应(用all.equal(res$t0, predict(modelzinb))验证(和元素t中自举样本的估计平均响应(这是维度R x nrow(df)的矩阵,其中R是自举样本的数量(。

  4. 剩下要做的就是根据拟合到自举样本的模型,根据估计的平均响应计算置信区间。为此,我们可以使用方便的功能boot::boot.ci.该函数允许计算不同类型的置信(基本、正常、BCa 等(。在这里,我们使用"basic"进行演示。我不认为这是最好的选择。

    boot::boot.ci采用一个index参数,该参数对应于估计平均响应向量的输入。实际间隔存储在作为boot.ci返回对象的list元素存储的矩阵的最后 2 列(第 4 列和第 5 列(中。

    CI <- setNames(as.data.frame(t(sapply(1:nrow(df), function(row)
    boot.ci(res, conf = 0.95, type = "basic", index = row)$basic[, 4:5]))),
    c("lower", "upper"))
    
  5. 现在我们已经完成了,我们可以将置信区间与原始数据(包括预测的平均响应值(进行列绑定。

    df_all <- cbind.data.frame(df, response = predict(modelzinb), CI)
    #   fertilizer level repro response      lower    upper
    #1           N   low     0 29.72057   5.876731 46.48165
    #2           N   low    90 29.72057   5.876731 46.48165
    #3           N  high     2 21.99345 -15.228956 38.86421
    #4           N  high     4 21.99345 -15.228956 38.86421
    #5           N   low     0 29.72057   5.876731 46.48165
    #6           N   low    80 29.72057   5.876731 46.48165
    #7           N  high     1 21.99345 -15.228956 38.86421
    #8           N  high    90 21.99345 -15.228956 38.86421
    #9           N   low     2 29.72057   5.876731 46.48165
    #10          N   low    33 29.72057   5.876731 46.48165
    #11          N  high    56 21.99345 -15.228956 38.86421
    #12          N  high     0 21.99345 -15.228956 38.86421
    #13          P   low    99 24.22626  -9.225827 46.17656
    #14          P   low   100 24.22626  -9.225827 46.17656
    #15          P  high    66 22.71826   2.595246 39.88333
    #16          P  high    80 22.71826   2.595246 39.88333
    #17          P   low     1 24.22626  -9.225827 46.17656
    #18          P   low     0 24.22626  -9.225827 46.17656
    #19          P  high     2 22.71826   2.595246 39.88333
    #20          P  high    33 22.71826   2.595246 39.88333
    #21          P   low     0 24.22626  -9.225827 46.17656
    #22          P   low     0 24.22626  -9.225827 46.17656
    #23          P  high     1 22.71826   2.595246 39.88333
    #24          P  high     2 22.71826   2.595246 39.88333
    #25          N   low    90 29.72057   5.876731 46.48165
    #26          N   low     5 29.72057   5.876731 46.48165
    #27          N  high     2 21.99345 -15.228956 38.86421
    #28          N  high     2 21.99345 -15.228956 38.86421
    #29          N   low     5 29.72057   5.876731 46.48165
    #30          N   low     8 29.72057   5.876731 46.48165
    #31          N  high     0 21.99345 -15.228956 38.86421
    #32          N  high     1 21.99345 -15.228956 38.86421
    #33          N   low    90 29.72057   5.876731 46.48165
    #34          N   low     2 29.72057   5.876731 46.48165
    #35          N  high     4 21.99345 -15.228956 38.86421
    #36          N  high    66 21.99345 -15.228956 38.86421
    #37          P   low     0 24.22626  -9.225827 46.17656
    #38          P   low     0 24.22626  -9.225827 46.17656
    #39          P  high     0 22.71826   2.595246 39.88333
    #40          P  high     0 22.71826   2.595246 39.88333
    #41          P   low     1 24.22626  -9.225827 46.17656
    #42          P   low     2 24.22626  -9.225827 46.17656
    #43          P  high    90 22.71826   2.595246 39.88333
    #44          P  high     5 22.71826   2.595246 39.88333
    #45          P   low     2 24.22626  -9.225827 46.17656
    #46          P   low     5 24.22626  -9.225827 46.17656
    #47          P  high     8 22.71826   2.595246 39.88333
    #48          P   low    55 24.22626  -9.225827 46.17656
    #...
    

示例数据

df <- data.frame(
fertilizer = c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P"),
level = c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low"),
repro = c(0,90,2,4,0,80,1,90,2,33,56,0,99,100,66,80,1,0,2,33,0,0,1,2,90,5,2,2,5,8,0,1,90,2,4,66,0,0,0,0,1,2,90,5,2,5,8,55))

最新更新