r语言 - 更快的"Resample with replacement by cluster"



我有一个与Resample相同的问题,即通过集群进行替换,即我想进行集群引导。使用rbindlist(lapply(resampled_ids, function(resampled_id) df[df$id == resampled_id,]))解决这个问题的最佳方法是有效的,但由于我有一个大的数据集,所以这个重新采样步骤相当缓慢。我的问题是,有可能加快速度吗?

使用sequence进行索引。用较大的data.frame:演示

df <- data.frame(id = rep.int(1:1e2, sample(100:200, 1e2, replace = TRUE))[1:1e4], X = rnorm(1e4))
resampled_ids <- sample(unique(df$id), replace = TRUE)
idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(id = df$id[idx], X = df$X[idx])

针对rbindlist解决方案的基准测试:

library(data.table)
library(microbenchmark)
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#>       expr    min      lq      mean   median       uq     max neval
#>  rbindlist 9480.4 9921.95 11470.567 10431.05 12555.35 31178.2   100
#>   sequence  406.7  444.55   564.873   498.10   545.70  2818.4   100

注意,从索引向量创建新的data.frame比对原始data.frame进行行索引快得多。如果使用data.table,差异就不那么明显了,但令人惊讶的是,rbindlist解决方案变得更慢:

microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#>       expr    min     lq      mean   median       uq     max neval
#>  rbindlist 9431.9 9957.7 11101.545 10508.15 12395.25 15363.3   100
#>  sequence1 4284.5 4550.3  4866.891  4674.80  5009.90  8350.1   100
#>  sequence2  414.1  455.6   541.590   508.40   551.40  2881.1   100
setDT(df)

microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.table(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#>       expr     min       lq      mean   median      uq     max neval
#>  rbindlist 14877.4 15878.30 17181.572 16348.50 18527.6 22520.9   100
#>  sequence1   795.0  1016.80  1187.266  1101.95  1326.7  2566.5   100
#>  sequence2   386.4   441.75   556.226   473.70   500.9  3373.6   100

更新

回复jay.sf:的评论

lens <- tabulate(df$id)[resampled_ids]
idx <- sequence(lens, match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(cluster = rep.int(seq_along(resampled_ids), lens), id = df$id[idx], X = df$X[idx])

CCD_ 9对应于CCD_。

f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )

试试这个:

ind_id <- split(seq_along(f$id), f$id)
samp_id <- sample(names(ind_id), replace = TRUE)
f[unlist(ind_id[samp_id]), ]

最新更新