分别为r中的每一列拟合逻辑回归模型



我是r的新手,我需要运行一个循环或使用sapply将逻辑回归模型分别拟合到每列(以"rs"开头(,并使用以下公式:

AD~[co1:cov6]+rs1,然后对每个rs2、rs3等重复

我还需要重复上面的代码,但将公式更改为:

AD~[co1:cov6]+rs1*alcohol_intake,然后对每个rs2、rs3等重复

示例数据:

set.seed(543)
mydata <- data.frame(ID = 1:100, 
sex = sample(1:2, size = 500, replace = TRUE), 
age = runif(100, min= 35, max = 70), 
bmi = runif(100, min= 15, max = 35),
smoker = rep(c('smoker', 'not smoker')),
alcohol_intake = rep(c('regular', 'not regular')),
AD = sample(0:1, size = 500, replace = TRUE),
cov1 = runif(100, min= 0.0000, max = 1.0000),
cov2 = runif(100, min= -3.013e-02, max = 1.818e-02),
cov3 = runif(100, min= -3.562e-02, max = 1.540e-02),
cov4 = runif(100, min= -2.356e-02, max = 1.685e-02),
cov5 = runif(100, min= -1.392e-02, max = 2.894e-02),
cov6 = runif(100, min= -1.896e-02, max = 2.136e-02),
rs1 = sample(0:2, size = 500, replace = TRUE), 
rs2 = sample(0:2, size = 500, replace = TRUE),
rs3 = sample(0:2, size = 500, replace = TRUE),
rs4 = sample(0:2, size = 500, replace = TRUE),
rs5 = sample(0:2, size = 500, replace = TRUE),
rs6 = sample(0:2, size = 500, replace = TRUE),
rs7 = sample(0:2, size = 500, replace = TRUE),
rs8 = sample(0:2, size = 500, replace = TRUE),
rs9 = sample(0:2, size = 500, replace = TRUE),
rs10 = sample(0:2, size = 500, replace = TRUE),
rs11 = sample(0:2, size = 500, replace = TRUE),
rs12 = sample(0:2, size = 500, replace = TRUE),
rs13 = sample(0:2, size = 500, replace = TRUE),
rs14 = sample(0:2, size = 500, replace = TRUE),
rs15 = sample(0:2, size = 500, replace = TRUE),
rs16 = sample(0:2, size = 500, replace = TRUE),
rs17 = sample(0:2, size = 500, replace = TRUE),
rs18 = sample(0:2, size = 500, replace = TRUE),
rs19 = sample(0:2, size = 500, replace = TRUE),
rs20 = sample(0:2, size = 500, replace = TRUE),
rs21 = sample(0:2, size = 500, replace = TRUE),
rs22 = sample(0:2, size = 500, replace = TRUE),
rs23 = sample(0:2, size = 500, replace = TRUE),
rs24 = sample(0:2, size = 500, replace = TRUE),
rs25 = sample(0:2, size = 500, replace = TRUE),
rs26 = sample(0:2, size = 500, replace = TRUE),
rs27 = sample(0:2, size = 500, replace = TRUE),
rs28 = sample(0:2, size = 500, replace = TRUE),
rs29 = sample(0:2, size = 500, replace = TRUE),
rs30 = sample(0:2, size = 500, replace = TRUE)
)

谢谢。

基于我在上一次回答中的评论,我觉得从长远来看,转换为长格式是有益的。。。

转换为长格式很简单:

mydataLong <- mydata %>% 
pivot_longer(
cols=starts_with("rs"), 
names_to="Variable", 
values_to="Value"
)

它给出了一个看起来像这样的数据帧(删除一些列以显示感兴趣的数据帧部分(:

mydataLong %>% select(AD, starts_with("cov"), Variable, Value)
# A tibble: 15,000 × 9
AD  cov1    cov2    cov3   cov4     cov5    cov6 Variable Value
<int> <dbl>   <dbl>   <dbl>  <dbl>    <dbl>   <dbl> <chr>    <int>
1     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs1          0
2     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs2          0
3     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs3          1
4     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs4          0
5     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs5          2
6     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs6          1
7     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs7          0
8     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs8          0
9     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs9          1
10     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs10         2

为了准确地再现OP请求的分析,我可以使用tidyverse的group_map()函数。CCD_ 2将函数应用于分组的数据帧,并返回其元素是依次应用提供给数据帧的每组的函数的结果的列表。函数应该有两个参数,通常称为.x.y.x包含当前组中的数据。.y包含一行,并且仅包含定义当前组的那些列。

resultList <- mydataLong %>%
group_by(Variable) %>% 
group_map(
function(.x, .y) {
glm(
AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value, 
data=.x, 
family="binomial"
)
}
)
resultList[[17]]
Call:  glm(formula = AD ~ cov1 + cov2 + cov3 + cov4 + cov5 + cov6 + 
Value, family = "binomial", data = .x)
Coefficients:
(Intercept)         cov1         cov2         cov3         cov4         cov5         cov6        Value  
0.1370       0.0422       3.9611       1.0340     -10.2308       3.5968       3.1808      -0.1629  
Degrees of Freedom: 499 Total (i.e. Null);  492 Residual
Null Deviance:      693.1 
Residual Deviance: 688.5    AIC: 704.5

这种方法的优点是它是稳健的。一旦数据帧是长格式的,相同的代码将正确地用于任何数量的响应变量,而不管它们的名称是什么。";宽格式";解决方案例如,OP指出,在实际用例中,响应变量的名称看起来像rs65387690_A,而不是rs1。这需要对代码进行更改。这在这里没有必要:一个错误源已经被删除,生产代码的维护也减少了。

我认为这里还值得一提的是broom包。broom试图消除各种分析函数返回的对象格式的不一致性:它将返回的对象转换为数据帧,并尝试标准化列名。本质上,它通过定义一个通用函数tidy()来实现这一点。如果broom没有为您的特定风格的结果对象提供tidy()方法,那么编写自己的方法非常容易。

因此,调整我的解决方案以使用broom,并在管道的末尾添加对bind_rows()的调用,我得到了

library(broom)
mydataLong %>% 
group_by(Variable) %>% 
group_map(
function(.x, .y) {
glm(
AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value, 
data=.x, 
family="binomial"
) %>% 
tidy() %>% 
add_column(Variable=.y$Variable, .before=1)
}
) %>% 
bind_rows()
# A tibble: 240 × 6
Variable term        estimate std.error statistic p.value
<chr>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 rs1      (Intercept)  -0.308      0.245   -1.26    0.209 
2 rs1      cov1          0.0277     0.354    0.0784  0.938 
3 rs1      cov2          3.67       6.79     0.540   0.589 
4 rs1      cov3          0.362      6.01     0.0602  0.952 
5 rs1      cov4         -8.92       8.09    -1.10    0.270 
6 rs1      cov5          4.48       7.11     0.630   0.529 
7 rs1      cov6          2.27       8.06     0.282   0.778 
8 rs1      Value         0.263      0.114    2.30    0.0214
9 rs10     (Intercept)  -0.0858     0.248   -0.346   0.729 
10 rs10     cov1          0.0583     0.352    0.166   0.869 
# … with 230 more rows

因此,所有分析的结果都在一个数据帧中,我发现它比单个对象的列表更容易在下游处理。

CCD_ 16可以用在";宽格式";解决方案以及该";长格式";变异

我不知道你说的"[cov1:cov6]";。我假设你是指从cov1cov6的每个协变量。

考虑到你的数据框架的形式,我会做你想要的事情。关键是要知道as.formula()会将字符串转换为公式,因此您可以使用group_map0动态构建必要的公式。

resultList <- lapply(
1:30,
function(x) {
glm(
as.formula(paste0("AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + rs", x)), 
data=mydata, 
family="binomial"
)    
}
)

CCD_ 21是30个元素的列表。其xxxx元素是对涉及CCD_ 22的模型进行拟合的结果。例如:

resultList[[10]]
Call:  glm(formula = as.formula(paste0("AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + rs", 
x)), family = "binomial", data = mydata)
Coefficients:
(Intercept)         cov1         cov2         cov3         cov4         cov5         cov6         rs10  
0.1876      -0.2297      -2.8596      -5.3283      -1.2817       1.2763      18.5162      -0.1520  
Degrees of Freedom: 499 Total (i.e. Null);  492 Residual
Null Deviance:      693.1 
Residual Deviance: 683.9    AIC: 699.9

我将把它留给你来制定你需要为你的交互模型做什么。

顺便说一句,我还将研究将数据转换为长格式,然后使用tidyverse的group_by()group_map()函数。就我个人而言,我发现这种方法比lapply方法更容易、更健壮、更高效、更易于维护和记录。

相关内容

最新更新