我在R中的survival
包中遇到了奇怪的行为,在我直接联系包维护者之前,我希望比我聪明的人能告诉我这是不是一个bug,而是一些统计上更微妙的东西(因此我在这里发帖)。
上下文是,我想拟合一个基于时间层的时变系数的Cox模型,我还想要健壮的标准误差来解释聚类观察。令我惊讶的是,输出根据formula
参数中项的顺序而变化,并且它似乎选择性地忽略了相互作用,这取决于它们是在集群参数之前还是之后。
请参阅下面的代表,包括我的评论,指出了我看起来奇怪的事情。由于
附加问题:您知道为什么在expected_fit
中,age
在strata(tgroup)
之前出现,而ph.ecog
在strata(tgroup)
之后出现吗?library(survival);
library(tidyverse);
# Create a fake cluster in the lung data
lung_expanded <-
lung %>%
mutate(fake_id = sample(1:3, nrow(lung), replace = T))
# Model below is as expected: I get the expected time-stratified hazard
# ratios and robust standard errors
expected_fit <-
survSplit(Surv(time, status) ~ .,
data = lung_expanded,
cut = c(200, 300),
episode = "tgroup") %>%
coxph(formula = Surv(tstart, time, status) ~
# putting cluster prior to age:time interaction
# gives expected results
cluster(fake_id) + age:strata(tgroup) + ph.ecog:strata(tgroup))
expected_fit
#> Call:
#> coxph(formula = Surv(tstart, time, status) ~ age:strata(tgroup) +
#> strata(tgroup):ph.ecog, data = ., cluster = fake_id)
#>
#> coef exp(coef) se(coef) robust se z
#> age:strata(tgroup)tgroup=1 0.007912 1.007943 0.013858 0.010575 0.748
#> age:strata(tgroup)tgroup=2 -0.003444 0.996562 0.021995 0.014347 -0.240
#> age:strata(tgroup)tgroup=3 0.020987 1.021209 0.015680 0.006119 3.430
#> strata(tgroup)tgroup=1:ph.ecog 0.665031 1.944551 0.176550 0.177987 3.736
#> strata(tgroup)tgroup=2:ph.ecog 0.650765 1.917007 0.278808 0.032228 20.193
#> strata(tgroup)tgroup=3:ph.ecog 0.097243 1.102129 0.188521 0.131935 0.737
#> p
#> age:strata(tgroup)tgroup=1 0.454370
#> age:strata(tgroup)tgroup=2 0.810300
#> age:strata(tgroup)tgroup=3 0.000604
#> strata(tgroup)tgroup=1:ph.ecog 0.000187
#> strata(tgroup)tgroup=2:ph.ecog < 2e-16
#> strata(tgroup)tgroup=3:ph.ecog 0.461090
#>
#> Likelihood ratio test=24.9 on 6 df, p=0.0003561
#> n= 462, number of events= 164
#> (1 observation deleted due to missingness)
# Model output below differs from above, but it varies only in the ordering of the
# formula. I only get the hazard ratios corresponding to age but not
# those corresponding ecog status. moreover, I get an unexpected hazard
# ratio corresponding to a newly created variable called 'cluster(fake_id)'
suspicious_fit <-
survSplit(Surv(time, status) ~ .,
data = lung_expanded,
cut = c(200, 300),
episode = "tgroup") %>%
coxph(formula = Surv(tstart, time, status) ~
# putting age:time interaction prior to the cluster
# gives suspicious results
age:strata(tgroup) + cluster(fake_id) + ph.ecog:strata(tgroup))
suspicious_fit
#> Call:
#> coxph(formula = Surv(tstart, time, status) ~ cluster(fake_id) +
#> age:strata(tgroup), data = ., cluster = fake_id)
#>
#> coef exp(coef) se(coef) robust se z p
#> cluster(fake_id) -0.04020 0.96060 0.09952 0.05094 -0.789 0.43004
#> age:strata(tgroup)tgroup=1 0.01928 1.01947 0.01354 0.01477 1.306 0.19166
#> age:strata(tgroup)tgroup=2 0.01029 1.01034 0.02149 0.01782 0.578 0.56359
#> age:strata(tgroup)tgroup=3 0.02254 1.02279 0.01556 0.00655 3.441 0.00058
#>
#> Likelihood ratio test=4.62 on 4 df, p=0.3288
#> n= 463, number of events= 165
# In the model below, which has no time-varying strata,
# everything looks good.
expected_fit2 <-
lung_expanded %>%
coxph(formula = Surv(time, status) ~
# putting age:time interaction prior to the cluster
# gives suspicious results
age + cluster(fake_id) + ph.ecog)
expected_fit2
#> Call:
#> coxph(formula = Surv(time, status) ~ age + ph.ecog, data = .,
#> cluster = fake_id)
#>
#> coef exp(coef) se(coef) robust se z p
#> age 0.011281 1.011345 0.009319 0.006714 1.68 0.0929
#> ph.ecog 0.443485 1.558128 0.115831 0.028093 15.79 <2e-16
#>
#> Likelihood ratio test=19.06 on 2 df, p=7.279e-05
#> n= 227, number of events= 164
#> (1 observation deleted due to missingness)
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6 purrr_0.3.4
#> [5] readr_1.4.0 tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.3
#> [9] tidyverse_1.3.1 survival_3.2-11
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.1 xfun_0.23 splines_4.1.0 haven_2.4.1
#> [5] lattice_0.20-44 colorspace_2.0-1 vctrs_0.3.8 generics_0.1.0
#> [9] htmltools_0.5.1.1 yaml_2.2.1 utf8_1.2.1 rlang_0.4.11
#> [13] pillar_1.6.1 glue_1.4.2 withr_2.4.2 DBI_1.1.1
#> [17] dbplyr_2.1.1 readxl_1.3.1 modelr_0.1.8 lifecycle_1.0.0
#> [21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_1.0.0
#> [25] evaluate_0.14 knitr_1.33 fansi_0.5.0 highr_0.9
#> [29] Rcpp_1.0.7 broom_0.7.6 backports_1.2.1 scales_1.1.1
#> [33] jsonlite_1.7.2 fs_1.5.0 hms_1.1.0 digest_0.6.27
#> [37] stringi_1.6.2 grid_4.1.0 cli_2.5.0 tools_4.1.0
#> [41] magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2
#> [45] Matrix_1.3-4 xml2_1.3.2 reprex_2.0.0 lubridate_1.7.10
#> [49] assertthat_0.2.1 rmarkdown_2.8 httr_1.4.2 rstudioapi_0.13
#> [53] R6_2.5.0 compiler_4.1.0
Created on 2021-08-16 by the reprex package (v2.0.0)
```
我深入研究了survival::coxph
代码,找出了有问题的代码行,并联系了Terry Therneau(软件包维护者),他证实了这是一个bug。他指出有"……虫子在…以R为底的公式,通过更新继承。公式和重新公式。他已经告诉我3.2-13版本修复了这个问题。