R -数据.表聚合不选择NA



我想对每个列做一个聚合和不同的操作。我如何选择第一个非NA参数。我找到了一种方法,但我认为可以用一种更有效的方法来完成:

test <- data.table(A = c(NA,NA,1), B = c(1,2,3),C = c(NA,NA,1), D = c(1,2,2))
test[,list(A = A[!is.na(A)][1], B = max(B), C = C[!is.na(C)][1]), by = D]

有没有更有效的方法,我必须在一个非常大的集合上做很多次

如果您追求速度,您总是可以选择c++路线,它更快,但不是特别快。我修改了@akrun的代码,使其在当前版本中工作:

op <- function(d) d[,list(A = A[!is.na(A)][1], B = max(B), C = C[!is.na(C)][1]), by = D]
akrun_m <- function(d) d[, c(B = max(B), lapply(mget(c('A', 'C')), function(x) x[!is.na(x)][1])), by = D]
library(Rcpp)
cppFunction('double firstnum(NumericVector vec){
int n = vec.size();
for(int i = 0; i < n; ++i) {
   if (!R_IsNA(vec[i])) {
     return vec[i];
   }
}
}
')
seb <- function(d) d[,c(B = max(B), lapply(mget(c('A', 'C')), firstnum)), by = D]

现在进入基准测试:

library(microbenchmark)
set.seed(123)
> microbenchmark(op(test), akrun_m(test), seb(test))
Unit: microseconds
          expr     min       lq     mean   median       uq      max neval
      op(test) 647.881 669.5210 697.4699 687.1600 725.3510  949.747   100
 akrun_m(test) 691.242 723.5915 759.8265 758.3305 786.0095  906.734   100
     seb(test) 543.220 561.7975 624.5136 583.3550 606.9620 1816.716   100
n <- 1000000
test2 <- data.table(A = sample(c(0:9, NA), n, replace = TRUE),
                    B = sample(c(0:9), n, replace = TRUE),
                    C = sample(c(0:9, NA), n, replace = TRUE),
                    D = sample(c(0:9), n, replace = TRUE))
> microbenchmark(op(test2), akrun_m(test2), seb(test2))
Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
      op(test2) 24.19831 25.66229 29.12567 26.53464 27.04707 77.30163   100
 akrun_m(test2) 24.61977 25.95137 29.35343 26.88825 27.32421 85.68336   100
     seb(test2) 15.94157 17.19831 18.75357 17.70207 18.04410 69.51589   100

您可以看到c++实现更快,但不是特别快。

最新更新