r-在巨大的数据表中计数NA



我有一个包含数亿数据的大型数据表。我正在努力消除许多观察到的变量中缺少数据的条目。我正在使用以下代码:

DATA[, NR := Reduce("+", data.table(!is.na(.SD))),.SDcols = c("var1","var","var3","var4","var5","var6"),by=c("ID","day")] 

虽然我无法共享数据,但我想知道上面的代码是否是数据表中最快的,或者我还缺少什么?代码运行了很长一段时间,似乎还没有完成。

谢谢你的提示。

is.na(.SD)创建一个逻辑矩阵,然后我们将其转换为data.table,相反,它可以与lapply循环,然后与Reduce执行+

DATA[, NR := Reduce(`+`, lapply(.SD, is.na)),
.SDcols = paste0('var', 1:6), by = .(ID, day)]

基准

set.seed(24)
DATA <- data.table(ID = rep(1:500, each =  50), day = rep(1:25, length.out = 25000), var1=  sample(c(1:5, NA), 25000, replace = TRUE),
var1=  sample(c(1:40, NA), 25000, replace = TRUE), var2 =  sample(c(1:25, NA), 25000, replace = TRUE), var3 =  sample(c(1:35, NA), 25000, replace = TRUE),
var4=  sample(c(1:100, NA), 25000, replace = TRUE), var5 =  sample(c(1:50, NA), 25000, replace = TRUE), var6 =  sample(c(1:10, NA), 25000, replace = TRUE))
DATA1 <- copy(DATA)
system.time(DATA[, NR := Reduce("+", data.table(!is.na(.SD))),.SDcols = c("var1","var2","var3","var4","var5","var6"),by=c("ID","day")])
# user  system elapsed 
# 6.451   0.148   5.535 
system.time(DATA1[, NR := Reduce(`+`, lapply(.SD, is.na)), .SDcols = paste0('var', 1:6), by = .(ID, day)])
# user  system elapsed 
#  1.307   0.046   0.353 

或者另一个选项是rowSums

system.time(DATA1[, NR2 := rowSums(is.na(.SD)),
.SDcols = paste0('var', 1:6), by = .(ID, day)])
#   user  system elapsed 
#  1.434   0.028   0.456 

最新更新