R中彩票号码的有效抽样

我想编写一个函数，对n彩票进行采样，这些彩票的6个数字从1到45，而不进行替换。然而，我需要高效地完成这项工作，这意味着没有循环或类似循环的函数。(我想Rcpp也可以，但我更喜欢基本R中的矢量化解决方案(

无限制解决：

lottery_inef <- function(n){

t(replicate(n,
sample(1:45, 6)))
}

这里我得到一个矩阵，每一行对应一张彩票。现在，如果我想模拟数百万张彩票，速度会很慢，因此我对矢量化解决方案感兴趣。

我的想法是：

lottery_ef <- function(n){

m <- matrix(sample(1:45, n*6, replace = TRUE), ncol = 6)

# somehow subset the matrix without a loop to remove all the 
# rows that have non-unique values as in the lottery we can only draw each number once
}

对于高效的版本，我有点迷失在没有循环或apply((的子集设置上。如果有人能解决这个小问题，或者给我一个完全不同的方向，让我找到一个解决方案，我将不胜感激。

replicate在这种规模下实际上并没有做得那么好。使用实时编译(现在在R中使用了几年(，for循环可以更快，尤其是当我们可以精确地预分配数据结构时。我们还可以避免t():

lottery_inef <- function(n){
t(replicate(n,
sample(1:45, 6)))
}
lottery_preall <- function(n){
m = matrix(NA_integer_, nrow = n, ncol = 6)
for(i in 1:n) {
m[i, ] = sample.int(45L, size = 6)
}
m
}
nn = 1e6
microbenchmark::microbenchmark(
lottery_inef(nn), 
lottery_preall(nn),
times = 2
)
# Unit: seconds
#                expr      min       lq     mean   median       uq      max neval
#    lottery_inef(nn) 9.400862 9.400862 9.571756 9.571756 9.742649 9.742649     2
#  lottery_preall(nn) 4.948216 4.948216 5.454482 5.454482 5.960749 5.960749     2

replicate将结果累积到list中，然后需要检查每个结果的维度，然后才能决定将其简化为矩阵，并且必须进行转换。所有这些开销都通过预先分配的整数矩阵跳过，以实现大约2倍的加速。

我们也可以与vapply进行比较(快速测试显示vapply只比循环慢一点(，但我认为要想获得更高的速度，你需要并行运行——这在这里是一个很好的选择，可能会让你获得几乎与你使用的内核数相等的加速。

sample.int基本上只是对C代码的调用，所以使用Rcpp可能不会做得更好——我认为并行化是提高速度的最佳选择。

由于为这样大小的集合生成所有组合只需要几秒钟，因此可能值得这样做，然后将其子集用于"彩票"。下面，我使用sample()生成了100万个行索引(包括替换和不包括替换(，并在整个集合上使用括号括起来的子集来生成可能的票证。

如果你需要经常这样做，或者在不同的时间这样做，那么保存完整的组合集可能是值得的，而不是每次都重新生成它。几乎所有的处理都是生成完整的组合。在那之后选择"票"是很快的。

计时显示，创建所有组合需要约6秒，创建100万个索引需要约.2秒，创建包含100万行的方括号子集需要约.1秒。

set.seed(2)
tictoc::tic() #included for timing
# All possible lotto combinations as matrix, 1 per row
lotto_all <- t(combn(1:45, 6))
tictoc::toc() #included for timing
#> 5.899 sec elapsed
# A look at the data:
head(lotto_all)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    2    3    4    5    6
#> [2,]    1    2    3    4    5    7
#> [3,]    1    2    3    4    5    8
#> [4,]    1    2    3    4    5    9
#> [5,]    1    2    3    4    5   10
#> [6,]    1    2    3    4    5   11
# Getting index (row) numbers for our 'tickts' with & without replacement
tictoc::tic()
sample_indices_no_replacement <- sample(1:nrow(lotto_all), size = 1e6, replace = F)
tictoc::toc()
#> 0.178 sec elapsed
sample_indices_w_replacement <- sample(1:nrow(lotto_all), size = 1e6, replace = T)
# The number combinations of our 'tickets'
tictoc::tic()
sample_tickets_no_rep <- lotto_all[sample_indices_no_replacement,]
tictoc::toc()
#> 0.097 sec elapsed
sample_tickets_rep <- lotto_all[sample_indices_w_replacement,]
# A look at the sample tickets:
head(sample_tickets_no_rep)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    8   12   14   31   34   44
#> [2,]    6   10   16   26   32   36
#> [3,]    3    4   10   15   41   43
#> [4,]    2    3    5   17   33   36
#> [5,]    7   17   24   25   35   40
#> [6,]   32   33   34   36   39   43
# See that there are some duplicates using replacement = T
length(unique(sample_indices_no_replacement))
#> [1] 1000000
length(unique(sample_indices_w_replacement))
#> [1] 941309

^{由reprex包(v0.3.0(创建于2020-10-27}

由于效率是主要关注点，因此有几个包arrangements和RcppAlgos^*考虑到了这一特定任务。

在开始之前，我们首先说明，当我们使用sample时，我们无法控制结果的唯一性。每次绘制都来自均匀分布，因此我们可以多次重复绘制相同的排列。使用@Gregor的功能，我们有：

set.seed(42)
system.time(a <- lottery_inef(1e6))
user  system elapsed 
7.640   0.345   7.984
sum(duplicated(a))
[1] 86
set.seed(42)
system.time(b <- lottery_preall(1e6))
user  system elapsed 
3.673   0.256   3.929
sum(duplicated(b))
[1] 86

虽然使用包arrangements更快，但我们仍然看到相同的行为：

set.seed(42)
system.time(arng <- arrangements::permutations(45, 6, nsample = 1e6))
user  system elapsed 
0.761   0.021   0.785 
sum(duplicated(arng))
[1] 108

现在，使用包RcppAlgos，如果请求的结果数量小于结果总数(在我们的情况下超过50亿(，我们可以保证结果是唯一的：

RcppAlgos::permuteCount(45, 6)
[1] 5864443200
system.time(algosSer <- RcppAlgos::permuteSample(45, 6,
n = 1e6,
seed = 42))
user  system elapsed 
0.560   0.001   0.561
sum(duplicated(algosSer))
[1] 0

此外，我们可以通过nThreads参数使用多个线程，以获得更高的速度。

system.time(algosPar <- RcppAlgos::permuteSample(45, 6,
n = 1e6,
seed = 42,
nThreads = 4))
user  system elapsed 
0.574   0.001   0.280
## Results are the same as the serial version
identical(algosPar, algosSer)
[1] TRUE

^*我是RcppAlgos的作者

相关内容

最新更新

热门标签：