R中的无序向量,但相同元素应具有最小距离



我想随机化/混洗一个向量。一些矢量元素是相同的。在混洗之后,相同的元素应该具有三个的最小距离(即,另外两个元素应该在相同的元素之间(。

考虑以下R中的矢量示例:

x <- rep(LETTERS[1:5], 3)  # Create example vector
x
#  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"

如果我使用sample函数打乱向量,那么一些相同的元素可能过于靠近。例如,如果我使用下面的R代码;C";直接出现在位置5和6处:

set.seed(53135)
sample(x)                  # sample() function puts same elements too close
#  [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"

如何确保相同元素的最小距离为三?

因此,基本上我们需要有条件地从min.dist-1运行中尚未选择的x向量中采样一个元素。使用purrr的reduce我们可以实现这一点:

min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"

绑定在函数中

shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
[1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
[1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
[1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
[1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"

@27⏴9评论后:

shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))
A B C D E 
3 3 3 3 3 
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) : 
invalid first argument

更新

经过一番尝试和错误,考虑到并非总是有足够的元素来分隔min.dist这一事实,我提出了一个解决方案,该代码是上面代码中解释最多的:

shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), function(.x, ...){
# whether the value is in the tail of the aggregated vector
in.tail <- x %in% tail(.x, min.dist)
# whether a value still hasn't reached the max frequency
freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
# whether a value isn't in the aggregated vector
yet <- !x %in% .x
# the if is there basically to account for the cases when we don't have enough vars to space out the vectors
c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else  x[which(freq.got)[1]] )
}, .init=sample(x,1))
}

现在运行table(shuffle(rep(LETTERS[1:5], 3),2))将总是为所有变量返回3,并且我们可以肯定地说,在向量中,变量之间的最小距离为2。保证没有元素被复制的唯一方法是使用min.dist=length(unique(x))-1,否则将存在最大r < min.dist元素与其最后出现的元素没有min.dist距离的情况,并且如果存在这样的元素,则它们将在结果向量的length(x) + 1 - 1:min.dist子集中。

只是为了完全确定使用循环来检查输出向量的尾部是否具有唯一值:(删除打印语句,我只是出于演示目的使用它(

shuffler <- function(x, min.dist=2){
while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
l
}
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
A B C D E 
3 3 3 3 3 
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
[1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
A B C D E 
3 3 3 3 3 

更新:

shuffler <- function(x, min.dist=2){
while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
l
}

这个新版本对向量尾部的元素是否为min.dist进行了严格的测试,以前的版本适用于min.dist=2,但这个新版本做得更好。

如果你的数据很大,那么依靠概率来完成这类任务可能会更快。

这里有一个例子:

prob_shuffler = function(x, min.dist = 2){
n = length(x)
res = sample(x)
OK = FALSE

# We loop until we have a solution
while(!OK){
OK = TRUE
for(i in 1:min.dist){
# We check if identical elements are 'i' steps away
pblm = res[1:(n-i)] == res[-(1:i)]
if(any(pblm)){
if(sum(pblm) >= (n - i)/2){
# back to square 1
res = sample(x)
} else {
# we pair each identical element with 
# an extra one
extra = sample(which(!pblm), sum(pblm))
id_reshuffle = c(which(pblm), extra)
res[id_reshuffle] = sample(res[id_reshuffle])
}
# We recheck from the beginning
OK = FALSE
break
}
}
}
res
}

尽管while循环看起来很可怕,但在实践中收敛很快。当然,在min.dist处有两个字符的概率越低,收敛速度就越快。

Abdessabour Mtk和Carles Sans Fuentes目前的解决方案有效,但根据输入数据的大小,速度很快就会慢得令人望而却步。这里有一个基准:

library(microbenchmark)
x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520
microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#>                 expr       min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 1)    87.001   111.501    155.071   131.801   192.401    264.401    10
#>    shuffler_am(x, 1) 17218.100 18041.900  20324.301 18740.351 22296.301  26495.200    10
#>   shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701    10
microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#>                 expr     min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 2)   140.1   195.201   236.3312   245.252   263.202    354.101    10
#>    shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800  29133.400    10
#>   shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401    10
microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#>                 expr       min        lq        mean     median       uq        max neval
#>  prob_shuffler(x, 3)   318.001   450.402    631.5312    573.352    782.2   1070.401    10
#>    shuffler_am(x, 3) 19003.501 19622.300  23314.4808  20784.551  28281.5  32885.101    10
#>   shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901    10

我们可以注意到两件事:a(在所有逻辑中,prob_shuffler的速度取决于min.dist,而其他方法则不那么快,b(仅520次观测,prob_shuffler的速度就快了大约100倍(而且是按比例缩放的(。

当然,如果在min.dist之外有两个相同字符的概率非常高,那么递归方法应该更快。但在大多数实际情况下,概率方法更快。

我希望这个答案对你来说很好。它是用R基完成的,但它是有效的。如果你想逐行检查,我离开打印:

x <- rep(LETTERS[1:5], 3)  # Create example vector

shuffle <- function(x, min_dist=3){
#init variables   
result<-c() # result vector
count<-0
vec_use<-x
vec_keep<-c()
for(i in 1:length(x)){
#    print(paste0("iteration =", i))
if (count>min_dist){
valback<-vec_keep[1]
#      print(paste0("value to be returned:",  valback))
ntimes_valback<-(table(vec_keep)[valback])
vec_use<- c(vec_use,rep(valback,ntimes_valback))
#      print(paste0("vec_use after giving back valbak =", valback))
#      print(paste0(vec_use,","))
vec_keep <- vec_keep[!vec_keep %in% valback]
#      print(paste0("vec_keep after removing valback =", valback))
#      print(paste0(vec_keep,","))
}
val<-sample(vec_use,1)
#    print(paste0("val = ",val))#remove value
vec_keep<- c(vec_keep,x[x %in% val])
vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
#    print(paste0("vec_keep ="))
#    print(paste0(vec_keep,","))
vec_use <- vec_use[!vec_use %in% val]
#    print(paste0("vec_use ="))
#    print(paste0(vec_use,","))
result[i]<-val
count<-count+1
}
return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"

最新更新