r-data.table中组的第一个元素的累计和



我每隔半小时进行一些时间序列温度测量。我想计算一个平均累积增长度天数风格指标。我正在使用变量";日期时间";,但为了简单起见,省略了实际的约会时间。也不要担心这是否真的是增长学位天数的正确计算,事实并非如此。以下玩具数据模拟了这一挑战。

library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))

现在我计算一个"每日"平均值:

dt[,T_mean_daily:=mean(T), by=date]

现在,我想做的是计算T_mean_daily的累积和,并将其显示在一个新列中,但与T_mean_daily一样,在某个日期的每个"日期时间"重复。我在用cumsum进行可视化时遇到了一些问题。最终输出看起来像:

datetime   date T T_mean_daily T_sum
1:      1.0    1 4          5.6   5.6
2:      1.1    1 6          5.6   5.6
3:      1.2    1 9          5.6   5.6
4:      1.3    1 7          5.6   5.6
5:      1.4    1 3          5.6   5.6
6:      1.5    1 8          5.6   5.6
7:      1.6    1 3          5.6   5.6
8:      1.7    1 7          5.6   5.6
9:      1.8    1 8          5.6   5.6
10:      1.9    1 1          5.6   5.6
11:      2.0    2 2          3.6   9.2
12:      2.1    2 5          3.6   9.2
13:      2.2    2 4          3.6   9.2
14:      2.3    2 1          3.6   9.2
15:      2.4    2 9          3.6   9.2
16:      2.5    2 5          3.6   9.2
17:      2.6    2 2          3.6   9.2
18:      2.7    2 5          3.6   9.2
19:      2.8    2 2          3.6   9.2
20:      2.9    2 1          3.6   9.2
21:      3.0    3 1          5.9   15.1
22:      3.1    3 4          5.9   15.1

正在寻找data.table解决方案。这不是按组计算的cumsum,我正在寻找每个第一行的总和或所有组中的唯一值。

这里有另一种data.table方法。。。

setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")

解释

由于我们只需要使用每个日期的第一行,我们可以在data.table的i中使用!duplicated(date)来选择这些行。在j中,我们现在可以计算T_Mean_Daily的累积和
现在,我们在所有第一个日期行上都留下了一个具有正确cumsum值的列,并且NA介于两者之间,因此使用setnafillforlocf-在T_sum-列中的NA行上填充值。

基准

set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
microbenchmark::microbenchmark(
r2evans = {
test <- copy(dt)
test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
},
wimpel = {
test <- copy(dt)
setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
}
)

Unit: microseconds
expr    min      lq     mean  median      uq    max neval cld
r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1   100   b
wimpel  425.4  437.45  465.313  451.75  485.35  608.3   100  a 

如果我们只对每个date的第一行做一个临时子集,那么我们就可以使用cumsum并将其连接回原始数据中。

set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
dt
#      datetime  date     T T_mean_daily
#         <num> <int> <num>        <num>
#   1:      1.0     1     9          6.2
#   2:      1.1     1     9          6.2
#   3:      1.2     1     3          6.2
#   4:      1.3     1     8          6.2
#   5:      1.4     1     6          6.2
#   6:      1.5     1     5          6.2
#   7:      1.6     1     7          6.2
#   8:      1.7     1     2          6.2
#   9:      1.8     1     6          6.2
#  10:      1.9     1     7          6.2
#  ---                                  
#  91:     10.0    10     7          5.6
#  92:     10.1    10     1          5.6
#  93:     10.2    10     2          5.6
#  94:     10.3    10     9          5.6
#  95:     10.4    10     9          5.6
#  96:     10.5    10     7          5.6
#  97:     10.6    10     3          5.6
#  98:     10.7    10     5          5.6
#  99:     10.8    10     7          5.6
# 100:     10.9    10     6          5.6

聚合很简单:

dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)][]
#      date datetime     T T_mean_daily T_sum
#     <int>    <num> <num>        <num> <num>
#  1:     1        1     9          6.2   6.2
#  2:     2        2     5         12.2  12.2
#  3:     3        3     9         18.3  18.3
#  4:     4        4     7         23.6  23.6
#  5:     5        5     4         29.6  29.6
#  6:     6        6     4         34.1  34.1
#  7:     7        7     7         40.1  40.1
#  8:     8        8     1         43.1  43.1
#  9:     9        9     6         47.0  47.0
# 10:    10       10     7         52.6  52.6

我们可以将原始数据合并为:

dt[ dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
dt
#      datetime  date     T T_mean_daily T_sum
#         <num> <int> <num>        <num> <num>
#   1:      1.0     1     9          6.2   6.2
#   2:      1.1     1     9          6.2   6.2
#   3:      1.2     1     3          6.2   6.2
#   4:      1.3     1     8          6.2   6.2
#   5:      1.4     1     6          6.2   6.2
#   6:      1.5     1     5          6.2   6.2
#   7:      1.6     1     7          6.2   6.2
#   8:      1.7     1     2          6.2   6.2
#   9:      1.8     1     6          6.2   6.2
#  10:      1.9     1     7          6.2   6.2
#  ---                                        
#  91:     10.0    10     7          5.6  52.6
#  92:     10.1    10     1          5.6  52.6
#  93:     10.2    10     2          5.6  52.6
#  94:     10.3    10     9          5.6  52.6
#  95:     10.4    10     9          5.6  52.6
#  96:     10.5    10     7          5.6  52.6
#  97:     10.6    10     3          5.6  52.6
#  98:     10.7    10     5          5.6  52.6
#  99:     10.8    10     7          5.6  52.6
# 100:     10.9    10     6          5.6  52.6

最新更新