我每隔半小时进行一些时间序列温度测量。我想计算一个平均累积增长度天数风格指标。我正在使用变量";日期时间";,但为了简单起见,省略了实际的约会时间。也不要担心这是否真的是增长学位天数的正确计算,事实并非如此。以下玩具数据模拟了这一挑战。
library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
现在我计算一个"每日"平均值:
dt[,T_mean_daily:=mean(T), by=date]
现在,我想做的是计算T_mean_daily
的累积和,并将其显示在一个新列中,但与T_mean_daily
一样,在某个日期的每个"日期时间"重复。我在用cumsum
进行可视化时遇到了一些问题。最终输出看起来像:
datetime date T T_mean_daily T_sum
1: 1.0 1 4 5.6 5.6
2: 1.1 1 6 5.6 5.6
3: 1.2 1 9 5.6 5.6
4: 1.3 1 7 5.6 5.6
5: 1.4 1 3 5.6 5.6
6: 1.5 1 8 5.6 5.6
7: 1.6 1 3 5.6 5.6
8: 1.7 1 7 5.6 5.6
9: 1.8 1 8 5.6 5.6
10: 1.9 1 1 5.6 5.6
11: 2.0 2 2 3.6 9.2
12: 2.1 2 5 3.6 9.2
13: 2.2 2 4 3.6 9.2
14: 2.3 2 1 3.6 9.2
15: 2.4 2 9 3.6 9.2
16: 2.5 2 5 3.6 9.2
17: 2.6 2 2 3.6 9.2
18: 2.7 2 5 3.6 9.2
19: 2.8 2 2 3.6 9.2
20: 2.9 2 1 3.6 9.2
21: 3.0 3 1 5.9 15.1
22: 3.1 3 4 5.9 15.1
正在寻找data.table
解决方案。这不是按组计算的cumsum
,我正在寻找每个第一行的总和或所有组中的唯一值。
这里有另一种data.table
方法。。。
setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
解释
由于我们只需要使用每个日期的第一行,我们可以在data.table的i
中使用!duplicated(date)
来选择这些行。在j
中,我们现在可以计算T_Mean_Daily
的累积和
现在,我们在所有第一个日期行上都留下了一个具有正确cumsum值的列,并且NA介于两者之间,因此使用setnafill
forlocf
-在T_sum
-列中的NA行上填充值。
基准
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
microbenchmark::microbenchmark(
r2evans = {
test <- copy(dt)
test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
},
wimpel = {
test <- copy(dt)
setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
}
)
Unit: microseconds
expr min lq mean median uq max neval cld
r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1 100 b
wimpel 425.4 437.45 465.313 451.75 485.35 608.3 100 a
如果我们只对每个date
的第一行做一个临时子集,那么我们就可以使用cumsum
并将其连接回原始数据中。
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
dt
# datetime date T T_mean_daily
# <num> <int> <num> <num>
# 1: 1.0 1 9 6.2
# 2: 1.1 1 9 6.2
# 3: 1.2 1 3 6.2
# 4: 1.3 1 8 6.2
# 5: 1.4 1 6 6.2
# 6: 1.5 1 5 6.2
# 7: 1.6 1 7 6.2
# 8: 1.7 1 2 6.2
# 9: 1.8 1 6 6.2
# 10: 1.9 1 7 6.2
# ---
# 91: 10.0 10 7 5.6
# 92: 10.1 10 1 5.6
# 93: 10.2 10 2 5.6
# 94: 10.3 10 9 5.6
# 95: 10.4 10 9 5.6
# 96: 10.5 10 7 5.6
# 97: 10.6 10 3 5.6
# 98: 10.7 10 5 5.6
# 99: 10.8 10 7 5.6
# 100: 10.9 10 6 5.6
聚合很简单:
dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)][]
# date datetime T T_mean_daily T_sum
# <int> <num> <num> <num> <num>
# 1: 1 1 9 6.2 6.2
# 2: 2 2 5 12.2 12.2
# 3: 3 3 9 18.3 18.3
# 4: 4 4 7 23.6 23.6
# 5: 5 5 4 29.6 29.6
# 6: 6 6 4 34.1 34.1
# 7: 7 7 7 40.1 40.1
# 8: 8 8 1 43.1 43.1
# 9: 9 9 6 47.0 47.0
# 10: 10 10 7 52.6 52.6
我们可以将原始数据合并为:
dt[ dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
dt
# datetime date T T_mean_daily T_sum
# <num> <int> <num> <num> <num>
# 1: 1.0 1 9 6.2 6.2
# 2: 1.1 1 9 6.2 6.2
# 3: 1.2 1 3 6.2 6.2
# 4: 1.3 1 8 6.2 6.2
# 5: 1.4 1 6 6.2 6.2
# 6: 1.5 1 5 6.2 6.2
# 7: 1.6 1 7 6.2 6.2
# 8: 1.7 1 2 6.2 6.2
# 9: 1.8 1 6 6.2 6.2
# 10: 1.9 1 7 6.2 6.2
# ---
# 91: 10.0 10 7 5.6 52.6
# 92: 10.1 10 1 5.6 52.6
# 93: 10.2 10 2 5.6 52.6
# 94: 10.3 10 9 5.6 52.6
# 95: 10.4 10 9 5.6 52.6
# 96: 10.5 10 7 5.6 52.6
# 97: 10.6 10 3 5.6 52.6
# 98: 10.7 10 5 5.6 52.6
# 99: 10.8 10 7 5.6 52.6
# 100: 10.9 10 6 5.6 52.6