我想对每个列做一个聚合和不同的操作。我如何选择第一个非NA参数。我找到了一种方法,但我认为可以用一种更有效的方法来完成:
test <- data.table(A = c(NA,NA,1), B = c(1,2,3),C = c(NA,NA,1), D = c(1,2,2))
test[,list(A = A[!is.na(A)][1], B = max(B), C = C[!is.na(C)][1]), by = D]
有没有更有效的方法,我必须在一个非常大的集合上做很多次
如果您追求速度,您总是可以选择c++路线,它更快,但不是特别快。我修改了@akrun的代码,使其在当前版本中工作:
op <- function(d) d[,list(A = A[!is.na(A)][1], B = max(B), C = C[!is.na(C)][1]), by = D]
akrun_m <- function(d) d[, c(B = max(B), lapply(mget(c('A', 'C')), function(x) x[!is.na(x)][1])), by = D]
library(Rcpp)
cppFunction('double firstnum(NumericVector vec){
int n = vec.size();
for(int i = 0; i < n; ++i) {
if (!R_IsNA(vec[i])) {
return vec[i];
}
}
}
')
seb <- function(d) d[,c(B = max(B), lapply(mget(c('A', 'C')), firstnum)), by = D]
现在进入基准测试:
library(microbenchmark)
set.seed(123)
> microbenchmark(op(test), akrun_m(test), seb(test))
Unit: microseconds
expr min lq mean median uq max neval
op(test) 647.881 669.5210 697.4699 687.1600 725.3510 949.747 100
akrun_m(test) 691.242 723.5915 759.8265 758.3305 786.0095 906.734 100
seb(test) 543.220 561.7975 624.5136 583.3550 606.9620 1816.716 100
n <- 1000000
test2 <- data.table(A = sample(c(0:9, NA), n, replace = TRUE),
B = sample(c(0:9), n, replace = TRUE),
C = sample(c(0:9, NA), n, replace = TRUE),
D = sample(c(0:9), n, replace = TRUE))
> microbenchmark(op(test2), akrun_m(test2), seb(test2))
Unit: milliseconds
expr min lq mean median uq max neval
op(test2) 24.19831 25.66229 29.12567 26.53464 27.04707 77.30163 100
akrun_m(test2) 24.61977 25.95137 29.35343 26.88825 27.32421 85.68336 100
seb(test2) 15.94157 17.19831 18.75357 17.70207 18.04410 69.51589 100
您可以看到c++实现更快,但不是特别快。