提高R函数的效率和速度


使用

R 时,我始终牢记:"尽可能避免使用循环"。 但是,我现在被困住了,我无法找到一种奇妙的方式来编写我需要的代码。

郑重声明,经过几条评论,我上面的陈述不是正确的陈述,没有必要在这里避免循环以提高效率。

我有两个字符串向量作为输入,让我们称它们为 ab - 它们只能包含字母 "M""I""D"

a = c("M","I","D","D","M","M","M","M","M","M")
b = c("M","M","M","M","M","M","D","M","M")

我想要的输出是:

d = c("M","I","D","D","M","M","M","M","I","M","M")

以下函数给了我这样的输出:

my.function <- function(a, b)
{
  nrow.df = length(a) + length(which(b=="D"))
  my.df = data.frame(a = rep(NA, nrow.df),  
                      b = rep(NA, nrow.df), 
                      d = rep(NA, nrow.df))
  my.df$a[1:length(a)] = a
  my.df$b[1:length(b)] = b
  for (i in 1:nrow.df)
  {
    if(my.df$a[i] == "D") {
      my.df$d[i] = "D"
      my.df$b[(i+1):nrow.df] = my.df$b[i:(nrow.df-1)]
    } else if (my.df$b[i] == "D") {
      my.df$d[i] = "I"
      my.df$a[(i+1):nrow.df] = my.df$a[i:(nrow.df-1)]
    } else if (my.df$a[i] == "I") {
      my.df$d[i] = "I"
    } else if (my.df$b[i] == "I") {
      my.df$d[i] = "D"
    } else {
      my.df$d[i] = my.df$a[i]
    }
  }
  return(my.df$d)
}
> d = my.function(a,b)
> d
 [1] "M" "I" "D" "D" "M" "M" "M" "M" "I" "M" "M"

函数逻辑如下,每当a"D"时,它把一个"D"放进d,把向量b移1,反之亦然,每当b"D"时,它就把一个"I"放进da移1。

接下来,当a中有"I",但b中没有"D"时,将"I"放在a中,反之亦然,只要b中有"I",而不是a中有"D",就d放一个"D"。 否则,d = a .

这不是一个复杂的函数,但我正在努力如何使其 R 高效。 我使用 mclapply 应用了数百万次这个函数,所以快速实现这样的函数可以节省我很多时间。

你建议使用Rcpp吗? 会快得多吗? 数百万次与 Cpp 通信 R 是否有任何速度变慢,或者它只是与 Rcpp 自动通信?

根据我的评论,如果速度是一个问题,第 1 步是不要不必要地使用 data.frame s。这个答案没有解决循环问题(正如其他人已经说过的那样,如果做得好,在 R 中使用循环并没有错(。

下面是函数的一个非常轻微的修改版本,使用 vector s 而不是 data.frame s 来存储数据。

my.function.v <- function(a, b) {
  nrow.df = length(a) + length(which(b=="D"))
  A <- B <- D <- vector(length = nrow.df)
  A[1:length(a)] = a
  B[1:length(b)] = b
  for (i in 1:nrow.df)
  {
    if(A[i] == "D") {
      D[i] = "D"
      B[(i+1):nrow.df] = B[i:(nrow.df-1)]
    } else if (B[i] == "D") {
      D[i] = "I"
      A[(i+1):nrow.df] = A[i:(nrow.df-1)]
    } else if (A[i] == "I") {
      D[i] = "I"
    } else if (B[i] == "I") {
      D[i] = "D"
    } else {
      D[i] = A[i]
    }
  }
  return(D)
}

请注意下面的速度相对差异:

library(microbenchmark)
microbenchmark(my.function(a, b), my.function.v(a, b), f(a, b))
# Unit: microseconds
#                 expr      min        lq    median        uq      max neval
#    my.function(a, b) 1448.416 1490.8780 1511.3435 1547.3880 6674.332   100
#  my.function.v(a, b)  157.248  165.8725  171.6475  179.1865  324.722   100
#              f(a, b)  168.874  177.5455  184.8775  193.3455  416.551   100

可以看出,@mrip的功能也比原来的功能好得多。

我没有看到任何简单的方法来避免这里的循环。 但是,仍然有一种更有效的方法可以做到这一点。 问题是你实际上每次遇到字符D时都会ab移动,而像这样移动向量是一个O(n)操作,所以这个循环的运行时间实际上是O(n^2)

您可以简化代码并获得稍微更好的性能,如下所示:

f<-function(a,b){
 aSkipped<-0
 bSkipped<-0
 d<-rep(0,length(a)+sum(b=="D"))
 for(i in 1:length(d)){
    if(a[i-aSkipped] == "D") {
      d[i] = "D"
      bSkipped<-bSkipped+1
    } else if (b[i-bSkipped] == "D") {
      d[i] = "I"
      aSkipped<-aSkipped+1
    } else if (a[i-aSkipped] == "I") {
      d[i] = "I"
    } else if (b[i-bSkipped] == "I") {
      d[i] = "D"
    } else {
      d[i] = a[i-aSkipped]
    }
  }
  d
}

编辑时。 当输入变大时,您将真正看到巨大的性能改进。 对于小字符串,而不是太多的"D",这和Ananda Mahto的解决方案大约在同一时间运行:

> set.seed(123)
> a<-c(sample(c("M","I"),500,T))
> b<-c(sample(c("M","I"),500,T))
> a[sample(500,50)]<-"D"
> b[sample(500,50)]<-"D"
> microbenchmark(f(a,b),my.function.v(a,b))
Unit: milliseconds
                expr      min       lq   median       uq      max neval
             f(a, b) 4.259970 4.324046 4.368018 4.463925 9.694951   100
 my.function.v(a, b) 4.442873 4.497172 4.533196 4.639543 9.901044   100

但是对于长度为 50000 和 5000 "D"的字符串,差异很大:

> set.seed(123)
> a<-c(sample(c("M","I"),50000,T))
> b<-c(sample(c("M","I"),50000,T))
> a[sample(50000,5000)]<-"D"
> b[sample(50000,5000)]<-"D"
> system.time(f(a,b))
   user  system elapsed 
  0.460   0.000   0.463 
> system.time(my.function.v(a,b))
   user  system elapsed 
  7.056   0.008   7.077 

好的,这是 Rcpp 解决方案,正如预期的那样,它比 R 解决方案大打出手:

rcppFun<-"
CharacterVector fcpp(CharacterVector a,CharacterVector b,int size){
int aSkipped = 0;
int bSkipped = 0;
int asize = a.size();
Rcpp::CharacterVector d(size);
for(int i=0; i<size; i++){
    if(i-aSkipped<asize && a[i-aSkipped][0] == 'D') {
      d[i] = "D";
      bSkipped++;
    } else if (b[i-bSkipped][0] == 'D') {
      d[i] = "I";
      aSkipped++;
    } else if (a[i-aSkipped][0] == 'I') {
      d[i] = "I";
    } else if (b[i-bSkipped][0] == 'I') {
      d[i] = "D";
    } else {
      d[i] = a[i-aSkipped];
    }
}
 return d;
}"
require("Rcpp")
fcpp<-cppFunction(rcppFun)
f3<-function(a,b){
  fcpp(a,b,as.integer(length(a)+sum(b=="D")))
}

警告:该函数根本不执行参数检查,因此,如果您向它提供错误数据,则很容易获得 seg 错误。

如果你要经常打电话,Rcpp绝对是要走的路:

> with(ab(10),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr     min       lq   median       uq     max neval
             f(a, b) 103.993 107.5155 108.6815 109.7455 178.801   100
            f3(a, b)   7.354   8.1305   8.5575   9.1220  18.014   100
            f2(a, b)  87.081  90.4150  92.2730  94.2585 146.502   100
 my.function.v(a, b)  84.389  86.5140  87.6090  88.8340 109.106   100
> with(ab(100),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr     min        lq    median        uq      max neval
             f(a, b) 992.082 1018.9850 1032.0180 1071.0690 2784.710   100
            f3(a, b)  12.873   14.3605   14.7370   15.5095   35.582   100
            f2(a, b) 119.396  125.4405  129.3015  134.9915 1909.930   100
 my.function.v(a, b) 769.618  786.7865  802.2920  824.0820  905.737   100
> with(ab(1000),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr      min        lq     median        uq       max neval
             f(a, b) 9816.295 10065.065 10233.1350 10392.696 12383.373   100
            f3(a, b)   66.057    67.869    83.9075    87.231  1167.086   100
            f2(a, b) 1637.972  1760.258  2667.6985  3138.229 47610.317   100
 my.function.v(a, b) 9692.885 10272.425 10997.2595 11402.602 54315.922   100
> with(ab(10000),microbenchmark(f(a,b),f3(a,b),f2(a,b)))
Unit: microseconds
     expr        min         lq      median          uq        max neval
  f(a, b) 101644.922 103311.678 105185.5955 108342.4960 144620.777   100
 f3(a, b)    607.702    610.039    669.8515    678.1845    785.415   100
 f2(a, b) 221305.641 247952.345 254478.1580 341195.5510 656408.378   100
> 

只是为了展示如何完成它,它可以在没有 R 循环的情况下完成;这是一种方法。 当长度大约为 1000 或更少时,它更快,但当较大时速度较慢。 一个要点是,您肯定可以在 Rcpp 中加快速度。

f2 <- function(a,b) {
  da <- which(a=="D")
  db <- which(b=="D")
  dif <- outer(da, db, `<`) 
  da <- da + rowSums(!dif)
  db <- db + colSums(dif)
  ia <- which(a=="I")  
  ia <- ia + colSums(outer(db, ia, `<`))
  ib <- which(b=="I")
  ib <- ib + colSums(outer(da, ib, `<`))
  out <- rep("M", length(a) + length(db))
  out[da] <- "D"
  out[db] <- "I"
  out[ia] <- "I"
  out[ib] <- "D"
  out
}

用于生成数据

ab <- function(N) {
  set.seed(123)
  a<-c(sample(c("M","I"),N,TRUE))
  b<-c(sample(c("M","I"),N,TRUE))
  a[sample(N,N/10)]<-"D"
  b[sample(N,N/10)]<-"D"
  list(a=a,b=b)
}

计时:

> library(microbenchmark)
> with(ab(10), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
                expr    min       lq   median       uq     max neval
 my.function.v(a, b) 79.102  86.9005  89.3680  93.2410 279.761   100
             f(a, b) 84.334  91.1055  94.1790  98.2645 215.579   100
            f2(a, b) 94.807 101.5405 105.1625 108.9745 226.149   100
> with(ab(100), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
                expr     min       lq  median       uq      max neval
 my.function.v(a, b) 732.849 750.4480 762.906 845.0835 1953.371   100
             f(a, b) 789.380 805.8905 819.022 902.5865 1921.064   100
            f2(a, b) 124.442 129.1450 134.543 137.5910  237.498   100
> with(ab(1000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: milliseconds
                expr       min        lq    median        uq      max neval
 my.function.v(a, b) 10.146865 10.387144 10.695895 11.123164 13.08263   100
             f(a, b)  7.776286  7.973918  8.266882  8.633563  9.98204   100
            f2(a, b)  1.322295  1.355601  1.385302  1.465469  1.85349   100
> with(ab(10000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b), times=10))
Unit: milliseconds
                expr      min        lq    median        uq       max neval
 my.function.v(a, b) 429.4030 435.00373 439.06706 442.51650 465.00124    10
             f(a, b)  80.7709  83.71715  85.14887  88.02067  89.00047    10
            f2(a, b) 164.7807 170.37608 175.94281 247.78353 251.14653    10

最新更新