我有一个0和1的矩阵。~30%的样本是1,我想在这个百分比周围估计一个置信区间(例如,如果我对整个人口进行抽样,可能会有28-32%的样本是1)。为此,您可以从样本中进行引导(通过替换重新绘制样本N次,并分析所有重新绘制的样本中15的百分比分布)。然而,我的数据是嵌套的(高度相关的)在行和列中。我尝试了这种嵌套性是否会产生差异(因为我有二元变量,所以我可以使用rflip()来模拟有偏差的硬币投掷),结果确实如此:
library("mosaic")
#### data example ####
c1<-c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
c2<-c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
d<-cbind.data.frame(c1,c2)
#### a) resample over entire data ####
b<-vector()
for (i in 1:10000){
b[i] <- rflip(20, # Flip 20 times,
6/20)/ # Probability for "1": 6/20, i.e., probability for "0": 14/20
20 # divide by 20 to return relative frequency
}
mean(b)# returns 0.3007955 # mean over 10000 replications: close to 6/20
sd(b) # returns 0.1024339 # standard deviation important to compute confidence interval
#### b) resample per column ####
b1 <- vector()
b2 <- vector()
bt <- vector()
for (i in 1:10000){
b1[i] <- rflip(10,(5/10)) # Flip 10 times with probablility for c1
b2[i] <- rflip(10,(1/10)) # Flip 10 times with probablility for c2
bt[i] <- (b1[i]+b2[i])/20 # sum up all 20 flips and divide by 20 to return relative frequency
}
mean(bt)# returns 0.3001475 # mean similar to a)
sd(bt) # returns 0.09214384 # standard deviation smaller than a)
当我从c1列重新绘制10次,从c2列重新绘制10次,并重复此过程10,000次时,观察到的概率分布比从整个数据中采样20次时更窄。如果概率为1;方法a)和b)在两列中相同,导致相同的标准差。
我现在不仅要考虑列,还要考虑行,例如,我想从列1中绘制10次,从列2中绘制10次,并且我想限制在这20次绘制中每行必须有两次绘制。我的第一个想法是:
forloop {
- 随机列顺序
- 从第1列绘制10次,但约束每行最多有2次重绘制
- 从第2列绘制10次,但约束从第1列加上从第2列的重绘制最多为每行2次(如果我们从第1行对第1列进行了2次重绘制,则从第1行对第2列不进行重绘制)
}
有谁知道怎么做或者有更好的主意吗?可能是与rflip()不同的函数。会帮我很多的!
谢谢,ajj
看看r2dtable
nrows <- 10L
ncols <- 6L
nr <- rep(ncols, nrows)
nc <- rep(nrows, ncols)
m <- r2dtable(1, nr, nc)[[1]]
m
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2 0 1 1 1 1
#> [2,] 2 1 1 0 1 1
#> [3,] 0 1 1 2 2 0
#> [4,] 0 3 1 1 0 1
#> [5,] 1 2 1 0 1 1
#> [6,] 0 0 1 2 0 3
#> [7,] 3 0 1 1 0 1
#> [8,] 2 0 0 1 3 0
#> [9,] 0 0 1 2 1 2
#> [10,] 0 3 2 0 1 0
rowSums(m)
#> [1] 6 6 6 6 6 6 6 6 6 6
colSums(m)
#> [1] 10 10 10 10 10 10
我想从列1中绘制10次,从列2和列1中绘制10次想要约束这20次抽签中必须有两次抽签每行.
就是:
nrows <- 10L
ncols <- 2L
nr <- rep(ncols, nrows)
nc <- rep(nrows, ncols)
m <- r2dtable(1, nr, nc)[[1]]
m
#> [,1] [,2]
#> [1,] 2 0
#> [2,] 0 2
#> [3,] 1 1
#> [4,] 2 0
#> [5,] 0 2
#> [6,] 2 0
#> [7,] 1 1
#> [8,] 1 1
#> [9,] 0 2
#> [10,] 1 1
当重新采样时行/列计数受到约束时,您对较小的SD是正确的:
broadcast <- Rcpp::cppFunction(
"arma::cube broadcast(arma::cube& m, arma::mat& d) {return(m.each_slice() % d);}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
c1 <- c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
c2 <- c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
d <- cbind(c1, c2)
nr <- nrow(d)
nc <- ncol(d)
nreps <- 1e4L
bt <- colSums(
broadcast(
simplify2array(
r2dtable(
nreps,
rep(nc, nr),
rep(nr, nc)
)
),
d
),
dims = 2
)/nr/nc
p <- mean(d)
mean(d) # true mean
#> [1] 0.3
mean(bt) # estimated mean
#> [1] 0.300685
sqrt(p*(1 - p)/nr/nc) # expected SD from uniform samples of size nr*nc
#> [1] 0.1024695
sqrt((p*(1 - p) - var(colMeans(d))*(1 - 1/nc))/nr/nc) # expected SD from column-wise resampling
#> [1] 0.09219544
sd(bt) # estimated SD from constrained row and column resampling
#> [1] 0.05604547
sample(rep(1:10,2),size=10,replace=FALSE)