具有固定列和的随机二进制数据帧



我正在尝试构建一个完全由1和0组成的数据帧。它应该是随机构建的,除了每个列需要加起来等于一个指定的值。

如果这只是一个数据帧,我会知道如何做到这一点,但它需要被构建到一个函数中,在这个函数中,它将作为一个迭代过程来完成,高达1000倍。

一种有效的方法是对每个列使用适当数量的1和0对向量进行洗牌。您可以定义以下函数来生成一个矩阵,该矩阵具有指定的行数和每列中的1个数:

build.mat <- function(nrow, csums) {
  sapply(csums, function(x) sample(rep(c(0, 1), c(nrow-x, x))))
}
set.seed(144)
build.mat(5, 0:5)
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    0    1    1
# [2,]    0    0    0    1    0    1
# [3,]    0    0    0    0    1    1
# [4,]    0    1    1    1    1    1
# [5,]    0    0    1    1    1    1

要构建列表,可以对每个矩阵的列和使用lapply:

cslist <- list(1:3, c(4, 2))
set.seed(144)
lapply(cslist, build.mat, nrow=5)
# [[1]]
#      [,1] [,2] [,3]
# [1,]    0    1    1
# [2,]    0    0    0
# [3,]    0    0    0
# [4,]    0    1    1
# [5,]    1    0    1
# 
# [[2]]
#      [,1] [,2]
# [1,]    0    0
# [2,]    1    0
# [3,]    1    1
# [4,]    1    0
# [5,]    1    1

如果0比1多,或者相反,@akrun的方法可能更快:

build_01_mat <- function(n,n1s){
  nc        <- length(n1s)
  zerofirst <- sum(n1s) < n*nc/2
  tochange  <- if (zerofirst) n1s else n-n1s
  mat       <- matrix(if (zerofirst) 0L else 1L,n,nc)
  mat[cbind(
    unlist(c(sapply((1:nc)[tochange>0],function(col)sample(1:n,tochange[col])))),
    rep(1:nc,tochange)
  )] <- if (zerofirst) 1L else 0L
  mat
}
set.seed(1)
build_01_mat(5,c(1,3,0))
#      [,1] [,2] [,3]
# [1,]    0    0    0
# [2,]    1    1    0
# [3,]    0    1    0
# [4,]    0    1    0
# [5,]    0    0    0

一些基准:

require(rbenchmark)
# similar numbers of zeros and ones
benchmark(
  permute=build.mat(1e7,1e7/2),
  replace=build_01_mat(1e7,1e7/2),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10    7.68    1.126      6.59
# 2 replace           10    6.82    1.000      6.27
# many more zeros than ones
benchmark(
  permute=build.mat(1e6,rep(10,20)),
  replace=build_01_mat(1e6,rep(10,20)),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10   10.28    3.779      8.51
# 2 replace           10    2.72    1.000      2.23
# many more ones than zeros
benchmark(
  permute=build.mat(1e6,1e6-rep(10,20)),
  replace=build_01_mat(1e6,1e6-rep(10,20)),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10   10.94    4.341      9.28
# 2 replace           10    2.52    1.000      2.09

最新更新