R data.table如果超过大型数据集的特定阈值,则将列值的其余部分设置为下一列值



我正在研究一种简单的调峰算法,并寻找最优化的方法,如果列值的剩余值超过某个阈值,则将其设置为下一列。

考虑到我有下面的示例数据集,每个阈值都设置了特定的阈值,目标是获得一个data.table,其中值由其阈值限制,其余值添加到下一列值(不超过其阈值(,依此类推,达到特定的窗口限制。

loads <- data.table(index = 1:3,
time1 = c(6600,3000, 12000),
time2 = c(12000, 4000, 2000),
time3 = c(0, 0, 0),
time4 = c(3000,12000,0),
time5 = c(5000, 2000, 3000),
time6 = c(0, 0, 0),
time7 = c(15000, 0, 0))
thresholds <- c("time1" = 5000, 
"time2" = 5000,
"time3" = 5000,
"time4" = 12000,
"time5" = 12000,
"time6" = 12000,
"time7" = 5000)

对于7列的窗口,这应该会产生以下数据。表:

res <- data.table(index = 1:3,
time1 = c(5000, 3000, 5000),
time2 = c(5000, 4000, 5000),
time3 = c(5000, 0, 4000),
time4 = c(6600, 12000, 0),
time5 = c(5000, 2000, 3000),
time6 = c(0, 0, 0),
time7 = c(5000, 0, 0))

我知道有一些明显的方法可以做到这一点,但我正在寻找一种更矢量化/data.table的方法来实现这一点。

我认为这不容易(甚至不可能?(;只是";矢量化/data.table-规范代码,但这里有一个直接的for循环,它可以像data.table一样高效地(我认为(执行,这是合理的。

前面:我将timeX添加到thresholds(Inf极限(和loads(0的值(中,作为一个catch-all列,这样我们就知道该行的余数中有多少是"0";丢失";。将它用于for循环也很方便(尽管可以在不重写代码的情况下完成(。

library(data.table)
thresholds <- c("time1" = 5000, 
"time2" = 5000,
"time3" = 5000,
"time4" = 12000,
"time5" = 12000,
"time6" = 12000,
"time7" = 5000,
"timeX" = Inf)
loads[, timeX := 0 ]
for (ind in seq_along(thresholds)) {
if (ind >= length(thresholds)) break
nm <- names(thresholds)[ind]
nm1 <- names(thresholds)[ind+1]
rmndr <- pmax(0, loads[[nm]] - thresholds[ind])
set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[ind]))
set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
loads
#    index time1 time2 time3 time4 time5 time6 time7 timeX
#    <int> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:     1  5000  5000  5000  6600  5000     0  5000 10000
# 2:     2  3000  4000     0 12000  2000     0     0     0
# 3:     3  5000  5000  4000     0  3000     0     0     0

或者,如果你真的不在乎被丢弃的数字,那么

## using unmodified `loads` and `thresholds`
for (ind in seq_along(thresholds)) {
nm <- names(thresholds)[ind]
rmndr <- pmax(0, loads[[nm]] - thresholds[nm])
set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[nm]))
if (ind == length(thresholds)) break
nm1 <- names(thresholds)[ind+1]
set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}

最新更新