我有一个包含200个变量的数据集,这些变量都有一些缺失的值。200个变量中的每一个都有另一列,我想用它来估算缺失的值。
示例数据:
have <- data.frame(ID = c(1:10), var1 = c(runif(7), NA, NA, NA), var1_fill = runif(10))
ID var1 var1_fill
1 1 0.68783885 0.140508053
2 2 0.74672512 0.001270443
3 3 0.09607276 0.917535359
4 4 0.03222775 0.363960434
5 5 0.03560543 0.901288399
6 6 0.46595122 0.725499220
7 7 0.42781890 0.781295939
8 8 NA 0.737999219
9 9 NA 0.456795266
10 10 NA 0.314562042
如果我想估算一列,我会使用以下代码:
have$var1_imputed <- ifelse(is.na(have$var1) == T, have$var1_fill, have$var1)
ID var1 var1_fill var1_imputed
1 1 0.68783885 0.140508053 0.68783885
2 2 0.74672512 0.001270443 0.74672512
3 3 0.09607276 0.917535359 0.09607276
4 4 0.03222775 0.363960434 0.03222775
5 5 0.03560543 0.901288399 0.03560543
6 6 0.46595122 0.725499220 0.46595122
7 7 0.42781890 0.781295939 0.42781890
8 8 NA 0.737999219 0.73799922
9 9 NA 0.456795266 0.45679527
10 10 NA 0.314562042 0.31456204
我很难弄清楚如何为200个变量编写一个循环,因为我不能使用$来引用列名。在实际数据集中,变量名不遵循任何模式,如var1、var2等。然而,原始200个变量位于第7列至第206列,用于相应插补的列为207列至第406列。插补列也与原始列同名,但有一个额外的后缀,如示例中所示(var1和var1_fill(。
使用fcoalesce
的data.table
选项
setDT(df)[
,
setNames(
Map(fcoalesce, .SD[, 7:206], .SD[, 207:406]),
paste0(names(.SD[, 7:206]), "_imputed")
)
]
可以使用执行以下操作的for-loop
结构:
- 对数值类列求值
- 用整列平均值替换
NA
行
for(i in 1:ncol(df)){
if(is.numeric(df[[i]])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
else {
next
}
}
控制台输出:
# ID var1 var1_fill
#1 1 0.01655469 0.5765553
#2 2 0.36868666 0.7912901
#3 3 0.80009094 0.7261624
#4 4 0.81749627 0.1174201
#5 5 0.10803860 0.5773327
#6 6 0.95316825 0.5261833
#7 7 0.34709855 0.2248959
#8 8 0.48730485 0.9822904
#9 9 0.48730485 0.8536809
#10 10 0.48730485 0.8169835
数据
df <- structure(list(ID = 1:10, var1 = c(0.212889024056494, 0.708460660418496,
0.135542315198109, 0.928928294451907, 0.893806081730872, 0.853342124959454,
0.226619977504015, NA, NA, NA), var1_fill = c(0.200933166779578,
0.939760707085952, 0.201024484355003, 0.843706431100145, 0.749990617623553,
0.51712017855607, 0.521659950027242, 0.168859238736331, 0.826423087157309,
0.930347595131025)), class = "data.frame", row.names = c(NA,
-10L))