四舍五入到列表中最接近的任意数字



我基本上是在寻找一种在R中对Ruby脚本进行变体的方法。
我有一个任意的数字列表(在这种情况下是回归图的调节器步骤),它们之间的距离不相等,我想将这些数字周围范围内的值四舍五入到列表中最接近的数字。范围不重叠。

arbitrary.numbers <- c(4,10,15) / 10
numbers <- c(16:1 / 10, 0.39, 1.45)
range <- 0.1

预期输出:

numbers
## 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.39 1.45
round_to_nearest_neighbour_in_range(numbers,arbitrary.numbers,range)
## 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5

我有一个小的助手函数,可以解决我的特定问题,但它不是很灵活,而且它包含一个循环。我可以把它发布在这里,但我认为真正的解决方案看起来会完全不同。

不同的答案以速度计时(一百万个数字)

> numbers = rep(numbers,length.out = 1000000)
> system.time({ mvg.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
0.067 
> system.time({ rinker.loop.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
0.289 
> system.time({ rinker.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
1.403 
> system.time({ nograpes.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
1.971 
> system.time({ january.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
16.12 
> system.time({ shariff.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
15.833 
> system.time({ mplourde.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
9.613 
> system.time({ kohske.round(numbers,arbitrary.numbers,range) })[3]
elapsed 
26.274 

MvG的函数是最快的,大约比Tyler Rinker的第二个函数快5倍。

矢量化解决方案,没有任何apply族函数或循环:

关键是findInterval,它找到arbitrary.numbers中的"空间",其中numbers中的每个元素都在"之间"。因此,findInterval(6,c(2,4,7,8))返回2,因为6位于c(2,4,7,8)的第二个和第三个索引之间。

# arbitrary.numbers is assumed to be sorted.
# find the index of the number just below each number, and just above.
# So for 6 in c(2,4,7,8) we would find 2 and 3.
low<-findInterval(numbers,arbitrary.numbers) # find index of number just below
high<-low+1 # find the corresponding index just above.
# Find the actual absolute difference between the arbitrary number above and below.
# So for 6 in c(2,4,7,8) we would find 2 and 1. 
# (The absolute differences to 4 and 7).
low.diff<-numbers-arbitrary.numbers[ifelse(low==0,NA,low)]
high.diff<-arbitrary.numbers[ifelse(high==0,NA,high)]-numbers
# Find the minimum difference. 
# In the example we would find that 6 is closest to 7, 
# because the difference is 1.
mins<-pmin(low.diff,high.diff,na.rm=T) 
# For each number, pick the arbitrary number with the minimum difference.
# So for 6 pick out 7.
pick<-ifelse(!is.na(low.diff) & mins==low.diff,low,high)
# Compare the actual minimum difference to the range. 
ifelse(mins<=range+.Machine$double.eps,arbitrary.numbers[pick],numbers)
# [1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5

使用findInterval:的另一个解决方案

arbitrary.numbers<-sort(arbitrary.numbers)          # need them sorted
range <- range*1.000001                             # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers - range) # index of nearest
nearest <- c(-Inf, arbitrary.numbers)[nearest + 1]  # value of nearest
diff <- numbers - nearest                           # compute errors
snap <- diff <= range                               # only snap near numbers
numbers[snap] <- nearest[snap]                      # snap values to nearest
print(numbers)

上述代码中的nearest在数学上并不是最接近的数字。相反,它是最大的任意数,使得nearest[i] - range <= numbers[i],或者等效地nearest[i] <= numbers[i] + range。因此,我们一次找到最大的任意数,它要么在给定输入数的捕捉范围内,要么仍然太小。因此,我们只需要检查snap的一种方式。不需要绝对值,甚至之前对这篇文章的修订也不需要平方。

感谢Interval在findInterval的数据帧上搜索指针,因为我在通过nograpes在答案中识别它之前就在那里找到了它。

如果与你最初的问题相反,你有重叠的范围,你可以这样写:

arbitrary.numbers<-sort(arbitrary.numbers)        # need them sorted
range <- range*1.000001                           # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers) + 1 # index of interval
hi <- c(arbitrary.numbers, Inf)[nearest]          # next larger
nearest <- c(-Inf, arbitrary.numbers)[nearest]    # next smaller
takehi <- (hi - numbers) < (numbers - nearest)    # larger better than smaller
nearest[takehi] <- hi[takehi]                     # now nearest is really nearest
snap <- abs(nearest - numbers) <= range           # only snap near numbers
numbers[snap] <- nearest[snap]                    # snap values to nearest
print(numbers)

在这个代码中,nearest实际上是最接近的数字。这是通过考虑每个区间的两个端点来实现的。从本质上讲,这与nograpes的版本非常相似,但它避免使用ifelseNA,这将有利于性能,因为它减少了分支指令的数量。

这就是您想要的吗?

> idx <- abs(outer(arbitrary.numbers, numbers, `-`)) <= (range+.Machine$double.eps)
> rounded <- arbitrary.numbers[apply(rbind(idx, colSums(idx) == 0), 2, which)]
> ifelse(is.na(rounded), numbers, rounded)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5

请注意,由于舍入误差(很可能),我使用范围=0.1000001来达到预期效果。

range <- range + 0.0000001
blah <- rbind( numbers, sapply( numbers, function( x ) abs( x - arbitrary.numbers ) ) )
ff <- function( y ) { if( min( y[-1] ) <= range + 0.000001 ) arbitrary.numbers[ which.min( y[ -1 ] ) ] else  y[1]  }
apply( blah, 2, ff )

这仍然更短:

sapply(numbers, function(x) ifelse(min(abs(arbitrary.numbers - x)) > 
range + .Machine$double.eps, x, arbitrary.numbers[which.min
(abs(arbitrary.numbers - x))] ))

感谢@MvG

另一个选项:

arb.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){ 
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
lapply(1:length(arbitrary.numbers), function(i){
numbers <<- arrnd(numbers, arbitrary.numbers[i], range)
}
)
numbers
}
arb.round(numbers, arbitrary.numbers, range)

收益率:

> arb.round(numbers, arbitrary.numbers, range)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5

EDIT:我删除了函数末尾的返回调用,因为它没有必要,而且可能会占用时间。

编辑:我认为这里的循环会更快:

loop.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){ 
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
for(i in seq_along(arbitrary.numbers)){
numbers <- arrnd(numbers, arbitrary.numbers[i], range)
}
numbers
}

最新更新