r语言 - 计算每组随时间的均值和方差



我有一个1,000,000行的数据帧。我想计算每个SIDTor的均值和方差,看看我是否可以预测Tor何时开始超出限制。下限为0.4,上限为0.7。下面是我的数据的一个小例子。

dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01", 
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05", 
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09", 
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13", 
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17", 
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21", 
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25", 
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L, 
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L, 
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L), 
Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166, 
0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521, 
0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259, 
0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271, 
0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903, 
0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L, 
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L, 
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-27L))
head(dat)
#         timestamp SID       Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192    0   1
#2 29-06-2021-06:01 351 0.6394163    0   0
#3 29-06-2021-06:02 304 0.2882821    1   0
#4 29-06-2021-06:03 357 0.9323622    0   1
#5 29-06-2021-06:04 358 0.3686166    1   0
#6 29-06-2021-06:05 302 0.4217556    0   0
  • Timestamp为记录样品时

  • SID为读取部件的ID。这些值可以是301 ~ 310和351 ~ 360

  • Tor为实际读数,其数据类型为<dbl>

  • Lowt为二进制变量,表示Tor读数低于下限值。

  • Hit为二进制变量,表示Tor读数低于上限。

我读过关于方差的书,但我似乎无法理解它。如果有任何帮助就太好了。

这是一个非常好的问题。你想计算累积平均值累积方差随着时间的推移,Tor/SID考虑到实际数据集的容量,使用在线算法是合适的。。关于算法细节,请参阅我和本杰明在2018年对这个话题的回答。总之,我的贡献是:

cummean <- function (x) cumsum(x) / seq_along(x)
cumvar <- function (x, sd = FALSE) {
x <- x - x[sample.int(length(x), 1)]
n <- seq_along(x)
v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
if (sd) v <- sqrt(v)
v
}
这里需要做的额外工作是为每个SID应用这些函数。
## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]
## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)
## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)
## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar

结果如下。不要对方差中的NaN感到惊讶。每个SID中的第一个值始终是NaN。这是正常的(我们只能在有2+数据时计算方差)。

## inspection
sorted_dat
#          timestamp SID        Tor Lowt Hit    runmean       runvar
#1  29-06-2021-06:00 301 0.70161919    0   1 0.70161919          NaN
#6  29-06-2021-06:05 302 0.42175565    0   0 0.42175565          NaN
#16 29-06-2021-06:15 302 0.44062517    0   0 0.43119041 0.0001780293
#7  29-06-2021-06:06 303 0.40973592    1   0 0.40973592          NaN
#17 29-06-2021-06:16 303 0.63975950    0   0 0.52474771 0.0264554237
#3  29-06-2021-06:02 304 0.28828207    1   0 0.28828207          NaN
#10 29-06-2021-06:09 304 0.81810239    0   1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178    1   0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020    0   0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458    0   1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027    0   1 0.77566027          NaN
#24 29-06-2021-06:23 307 0.97079290    0   1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167    0   0 0.65939167          NaN
#22 29-06-2021-06:21 308 0.24467497    1   0 0.45203332 0.0859949690
#8  29-06-2021-06:07 309 0.94217020    0   1 0.94217020          NaN
#2  29-06-2021-06:01 351 0.63941629    0   0 0.63941629          NaN
#12 29-06-2021-06:11 351 0.24638798    1   0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184    1   0 0.05000184          NaN
#23 29-06-2021-06:22 352 0.38620466    1   0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419    0   1 0.76293419          NaN
#26 29-06-2021-06:25 354 0.19036674    1   0 0.19036674          NaN
#9  29-06-2021-06:08 356 0.38139652    1   0 0.38139652          NaN
#27 29-06-2021-06:26 356 0.56596405    0   0 0.47368029 0.0170325864
#4  29-06-2021-06:03 357 0.93236217    0   1 0.93236217          NaN
#14 29-06-2021-06:13 357 0.63263026    0   0 0.78249621 0.0449196080
#5  29-06-2021-06:04 358 0.36861663    1   0 0.36861663          NaN
#15 29-06-2021-06:14 358 0.66618385    0   0 0.51740024 0.0442731264

最新更新