我有一个1,000,000行的数据帧。我想计算每个SID
的Tor
的均值和方差,看看我是否可以预测Tor
何时开始超出限制。下限为0.4,上限为0.7。下面是我的数据的一个小例子。
dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01",
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05",
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09",
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13",
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17",
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21",
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25",
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L,
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L,
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L),
Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166,
0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521,
0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259,
0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271,
0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903,
0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-27L))
head(dat)
# timestamp SID Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192 0 1
#2 29-06-2021-06:01 351 0.6394163 0 0
#3 29-06-2021-06:02 304 0.2882821 1 0
#4 29-06-2021-06:03 357 0.9323622 0 1
#5 29-06-2021-06:04 358 0.3686166 1 0
#6 29-06-2021-06:05 302 0.4217556 0 0
Timestamp
为记录样品时SID
为读取部件的ID。这些值可以是301 ~ 310和351 ~ 360Tor
为实际读数,其数据类型为<dbl>
。Lowt
为二进制变量,表示Tor
读数低于下限值。Hit
为二进制变量,表示Tor
读数低于上限。
我读过关于方差的书,但我似乎无法理解它。如果有任何帮助就太好了。
这是一个非常好的问题。你想计算累积平均值和累积方差随着时间的推移,Tor
/SID
考虑到实际数据集的容量,使用在线算法是合适的。。关于算法细节,请参阅我和本杰明在2018年对这个话题的回答。总之,我的贡献是:
cummean <- function (x) cumsum(x) / seq_along(x)
cumvar <- function (x, sd = FALSE) {
x <- x - x[sample.int(length(x), 1)]
n <- seq_along(x)
v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
if (sd) v <- sqrt(v)
v
}
这里需要做的额外工作是为每个SID应用这些函数。
## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]
## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)
## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)
## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar
结果如下。不要对方差中的NaN感到惊讶。每个SID中的第一个值始终是NaN。这是正常的(我们只能在有2+数据时计算方差)。
## inspection
sorted_dat
# timestamp SID Tor Lowt Hit runmean runvar
#1 29-06-2021-06:00 301 0.70161919 0 1 0.70161919 NaN
#6 29-06-2021-06:05 302 0.42175565 0 0 0.42175565 NaN
#16 29-06-2021-06:15 302 0.44062517 0 0 0.43119041 0.0001780293
#7 29-06-2021-06:06 303 0.40973592 1 0 0.40973592 NaN
#17 29-06-2021-06:16 303 0.63975950 0 0 0.52474771 0.0264554237
#3 29-06-2021-06:02 304 0.28828207 1 0 0.28828207 NaN
#10 29-06-2021-06:09 304 0.81810239 0 1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178 1 0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020 0 0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458 0 1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027 0 1 0.77566027 NaN
#24 29-06-2021-06:23 307 0.97079290 0 1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167 0 0 0.65939167 NaN
#22 29-06-2021-06:21 308 0.24467497 1 0 0.45203332 0.0859949690
#8 29-06-2021-06:07 309 0.94217020 0 1 0.94217020 NaN
#2 29-06-2021-06:01 351 0.63941629 0 0 0.63941629 NaN
#12 29-06-2021-06:11 351 0.24638798 1 0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184 1 0 0.05000184 NaN
#23 29-06-2021-06:22 352 0.38620466 1 0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419 0 1 0.76293419 NaN
#26 29-06-2021-06:25 354 0.19036674 1 0 0.19036674 NaN
#9 29-06-2021-06:08 356 0.38139652 1 0 0.38139652 NaN
#27 29-06-2021-06:26 356 0.56596405 0 0 0.47368029 0.0170325864
#4 29-06-2021-06:03 357 0.93236217 0 1 0.93236217 NaN
#14 29-06-2021-06:13 357 0.63263026 0 0 0.78249621 0.0449196080
#5 29-06-2021-06:04 358 0.36861663 1 0 0.36861663 NaN
#15 29-06-2021-06:14 358 0.66618385 0 0 0.51740024 0.0442731264