我是r的新手,我需要运行一个循环或使用sapply将逻辑回归模型分别拟合到每列(以"rs"开头(,并使用以下公式:
AD~[co1:cov6]+rs1,然后对每个rs2、rs3等重复
我还需要重复上面的代码,但将公式更改为:
AD~[co1:cov6]+rs1*alcohol_intake,然后对每个rs2、rs3等重复
示例数据:
set.seed(543)
mydata <- data.frame(ID = 1:100,
sex = sample(1:2, size = 500, replace = TRUE),
age = runif(100, min= 35, max = 70),
bmi = runif(100, min= 15, max = 35),
smoker = rep(c('smoker', 'not smoker')),
alcohol_intake = rep(c('regular', 'not regular')),
AD = sample(0:1, size = 500, replace = TRUE),
cov1 = runif(100, min= 0.0000, max = 1.0000),
cov2 = runif(100, min= -3.013e-02, max = 1.818e-02),
cov3 = runif(100, min= -3.562e-02, max = 1.540e-02),
cov4 = runif(100, min= -2.356e-02, max = 1.685e-02),
cov5 = runif(100, min= -1.392e-02, max = 2.894e-02),
cov6 = runif(100, min= -1.896e-02, max = 2.136e-02),
rs1 = sample(0:2, size = 500, replace = TRUE),
rs2 = sample(0:2, size = 500, replace = TRUE),
rs3 = sample(0:2, size = 500, replace = TRUE),
rs4 = sample(0:2, size = 500, replace = TRUE),
rs5 = sample(0:2, size = 500, replace = TRUE),
rs6 = sample(0:2, size = 500, replace = TRUE),
rs7 = sample(0:2, size = 500, replace = TRUE),
rs8 = sample(0:2, size = 500, replace = TRUE),
rs9 = sample(0:2, size = 500, replace = TRUE),
rs10 = sample(0:2, size = 500, replace = TRUE),
rs11 = sample(0:2, size = 500, replace = TRUE),
rs12 = sample(0:2, size = 500, replace = TRUE),
rs13 = sample(0:2, size = 500, replace = TRUE),
rs14 = sample(0:2, size = 500, replace = TRUE),
rs15 = sample(0:2, size = 500, replace = TRUE),
rs16 = sample(0:2, size = 500, replace = TRUE),
rs17 = sample(0:2, size = 500, replace = TRUE),
rs18 = sample(0:2, size = 500, replace = TRUE),
rs19 = sample(0:2, size = 500, replace = TRUE),
rs20 = sample(0:2, size = 500, replace = TRUE),
rs21 = sample(0:2, size = 500, replace = TRUE),
rs22 = sample(0:2, size = 500, replace = TRUE),
rs23 = sample(0:2, size = 500, replace = TRUE),
rs24 = sample(0:2, size = 500, replace = TRUE),
rs25 = sample(0:2, size = 500, replace = TRUE),
rs26 = sample(0:2, size = 500, replace = TRUE),
rs27 = sample(0:2, size = 500, replace = TRUE),
rs28 = sample(0:2, size = 500, replace = TRUE),
rs29 = sample(0:2, size = 500, replace = TRUE),
rs30 = sample(0:2, size = 500, replace = TRUE)
)
谢谢。
基于我在上一次回答中的评论,我觉得从长远来看,转换为长格式是有益的。。。
转换为长格式很简单:
mydataLong <- mydata %>%
pivot_longer(
cols=starts_with("rs"),
names_to="Variable",
values_to="Value"
)
它给出了一个看起来像这样的数据帧(删除一些列以显示感兴趣的数据帧部分(:
mydataLong %>% select(AD, starts_with("cov"), Variable, Value)
# A tibble: 15,000 × 9
AD cov1 cov2 cov3 cov4 cov5 cov6 Variable Value
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
1 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs1 0
2 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs2 0
3 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs3 1
4 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs4 0
5 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs5 2
6 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs6 1
7 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs7 0
8 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs8 0
9 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs9 1
10 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs10 2
为了准确地再现OP请求的分析,我可以使用tidyverse的group_map()
函数。CCD_ 2将函数应用于分组的数据帧,并返回其元素是依次应用提供给数据帧的每组的函数的结果的列表。函数应该有两个参数,通常称为.x
和.y
。.x
包含当前组中的数据。.y
包含一行,并且仅包含定义当前组的那些列。
resultList <- mydataLong %>%
group_by(Variable) %>%
group_map(
function(.x, .y) {
glm(
AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value,
data=.x,
family="binomial"
)
}
)
resultList[[17]]
Call: glm(formula = AD ~ cov1 + cov2 + cov3 + cov4 + cov5 + cov6 +
Value, family = "binomial", data = .x)
Coefficients:
(Intercept) cov1 cov2 cov3 cov4 cov5 cov6 Value
0.1370 0.0422 3.9611 1.0340 -10.2308 3.5968 3.1808 -0.1629
Degrees of Freedom: 499 Total (i.e. Null); 492 Residual
Null Deviance: 693.1
Residual Deviance: 688.5 AIC: 704.5
这种方法的优点是它是稳健的。一旦数据帧是长格式的,相同的代码将正确地用于任何数量的响应变量,而不管它们的名称是什么。";宽格式";解决方案例如,OP指出,在实际用例中,响应变量的名称看起来像rs65387690_A
,而不是rs1
。这需要对代码进行更改。这在这里没有必要:一个错误源已经被删除,生产代码的维护也减少了。
我认为这里还值得一提的是broom
包。broom
试图消除各种分析函数返回的对象格式的不一致性:它将返回的对象转换为数据帧,并尝试标准化列名。本质上,它通过定义一个通用函数tidy()
来实现这一点。如果broom
没有为您的特定风格的结果对象提供tidy()
方法,那么编写自己的方法非常容易。
因此,调整我的解决方案以使用broom
,并在管道的末尾添加对bind_rows()
的调用,我得到了
library(broom)
mydataLong %>%
group_by(Variable) %>%
group_map(
function(.x, .y) {
glm(
AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value,
data=.x,
family="binomial"
) %>%
tidy() %>%
add_column(Variable=.y$Variable, .before=1)
}
) %>%
bind_rows()
# A tibble: 240 × 6
Variable term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 rs1 (Intercept) -0.308 0.245 -1.26 0.209
2 rs1 cov1 0.0277 0.354 0.0784 0.938
3 rs1 cov2 3.67 6.79 0.540 0.589
4 rs1 cov3 0.362 6.01 0.0602 0.952
5 rs1 cov4 -8.92 8.09 -1.10 0.270
6 rs1 cov5 4.48 7.11 0.630 0.529
7 rs1 cov6 2.27 8.06 0.282 0.778
8 rs1 Value 0.263 0.114 2.30 0.0214
9 rs10 (Intercept) -0.0858 0.248 -0.346 0.729
10 rs10 cov1 0.0583 0.352 0.166 0.869
# … with 230 more rows
因此,所有分析的结果都在一个数据帧中,我发现它比单个对象的列表更容易在下游处理。
CCD_ 16可以用在";宽格式";解决方案以及该";长格式";变异
我不知道你说的"[cov1:cov6]";。我假设你是指从cov1
到cov6
的每个协变量。
考虑到你的数据框架的形式,我会做你想要的事情。关键是要知道as.formula()
会将字符串转换为公式,因此您可以使用group_map
0动态构建必要的公式。
resultList <- lapply(
1:30,
function(x) {
glm(
as.formula(paste0("AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + rs", x)),
data=mydata,
family="binomial"
)
}
)
CCD_ 21是30个元素的列表。其xxxx元素是对涉及CCD_ 22的模型进行拟合的结果。例如:
resultList[[10]]
Call: glm(formula = as.formula(paste0("AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + rs",
x)), family = "binomial", data = mydata)
Coefficients:
(Intercept) cov1 cov2 cov3 cov4 cov5 cov6 rs10
0.1876 -0.2297 -2.8596 -5.3283 -1.2817 1.2763 18.5162 -0.1520
Degrees of Freedom: 499 Total (i.e. Null); 492 Residual
Null Deviance: 693.1
Residual Deviance: 683.9 AIC: 699.9
我将把它留给你来制定你需要为你的交互模型做什么。
顺便说一句,我还将研究将数据转换为长格式,然后使用tidyverse的group_by()
和group_map()
函数。就我个人而言,我发现这种方法比lapply
方法更容易、更健壮、更高效、更易于维护和记录。