R 中的组内操作(不是滚动总和)



我有一个数据集,由学生(id(和他们每年的年级组成:

library(data.table)
set.seed(1)
students <- data.table("id" = rep(1:10, each = 10),
"year" = rep(2000:2009, 10),
"grade" = sample(c(9:11, rep(NA, 5)), 100, replace = T))

以下是学生 1 的示例:

id year grade
1:  1 2000     9
2:  1 2001    NA
3:  1 2002    NA
4:  1 2003     9
5:  1 2004    10
6:  1 2005    NA
7:  1 2006    NA
8:  1 2007    11
9:  1 2008    NA

我希望有一种方法可以访问每个学生以前和将来的成绩,以执行不同的操作。例如,将学生的最后三个年级相加。这将产生如下所示的数据集:

id year grade sum_lag_3
1:  1 2000     9         9 # 1st window, size 1: 9
2:  1 2001    NA         9 
3:  1 2002    NA         9
4:  1 2003     9        18 # 2nd, size 2: 9 + 9 = 18 
5:  1 2004    10        28 # 3rd, size 3: 9 + 9 + 10 = 28
6:  1 2005    NA        28
7:  1 2006    NA        28
8:  1 2007    11        30 # 4th, size 3: 9 + 10 + 11 = 30 
9:  1 2008    NA        30
10:  1 2009    10        31 # 5th, size 3: 10 + 11 + 10 = 31
11:  2 2001    11        11 # 1st window, size 1: 11 

(所有结果如下所示(。

  • 然而,这不是一篇关于形成滚动总和的帖子。
  • 我希望能够在每个小组内更普遍地进行操作,为此,我需要找到一种方法来参考所有学生过去和未来的成绩。

因此,在第一行的情况下,由于没有先前的观察结果,这意味着"过去"向量为空,但"未来"向量为NA NA 9 10 NA NA 11 NA 10

类似地,对于第二行,"过去"向量将是9,"未来"向量将是:

NA 9 10 NA NA 11 NA 10

对于第三行,"过去"向量将9 NA,"未来"向量为:

9 10 NA NA 11 NA 10

这是我想要参考的信息,以进行不同的计算。仅在每个组中进行的计算,并且因上下文而异。最好我想使用data.table来做到这一点,而不是将我的数据重塑为宽格式。

我尝试执行以下操作:

students[, .SD[, sum_last_3:= ...], by = id]

但是我收到一条错误消息,指出此功能尚未在data.table上可用(其中...是任何操作的占位符。

谢谢大家。


与@chinsoon12类似,但使用zoo::rollapply可以轻松地将sum应用于部分窗口。

d[!is.na(grade), rs := rollapply(grade, 3, sum, align = "right", partial = TRUE), by = id]
d[ , rs := nafill(rs, type = "locf"), by = id]
#     id year grade sum_lag_3 rs
#  1:  1 2000     9         9  9
#  2:  1 2001    NA         9  9
#  3:  1 2002    NA         9  9
#  4:  1 2003     9        18 18
#  5:  1 2004    10        28 28
#  6:  1 2005    NA        28 28
#  7:  1 2006    NA        28 28
#  8:  1 2007    11        30 30
#  9:  1 2008    NA        30 30
# 10:  1 2009    10        31 31
# 11:  2 2001    11        11 11

data.table::frollsum中,不支持partial窗口功能,尽管可以通过使用adaptive=TRUE"和自适应滚动功能来实现(见?frollsum(:

arf = function(n, len) if(len < n) seq.int(len) else c(seq.int(n), rep(n, len - n))
# if no 'grade' is shorter than n (the full window width), you only need: 
# c(seq.int(n), rep(n, len - n))
d[!is.na(grade) , rs2 := frollsum(grade, n = arf(3, .N), align = "right", adaptive = TRUE),
by = id]
d[ , rs2 := nafill(rs, type = "locf"), by = id]
#     id year grade sum_lag_3 rs rs2
#  1:  1 2000     9         9  9   9
#  2:  1 2001    NA         9  9   9
#  3:  1 2002    NA         9  9   9
#  4:  1 2003     9        18 18  18
#  5:  1 2004    10        28 28  28
#  6:  1 2005    NA        28 28  28
#  7:  1 2006    NA        28 28  28
#  8:  1 2007    11        30 30  30
#  9:  1 2008    NA        30 30  30
# 10:  1 2009    10        31 31  31
# 11:  2 2001    11        11 11  11

关于您的评论的说明:

我希望能够利用学生的过去和未来进行各种操作,而不仅仅是一笔钱

zoo::rollapply中,您可以将其他函数放在FUN参数中。目前data.table等价物frollapply没有adaptive参数。因此,我用于上述frollsum的方法还不能应用于frollapply

这是一个在data.table中使用frollsum的选项,首先将其应用于非 NA 值,然后再进行最后一次观察:

students[!is.na(grade), sum_lag_3 := 
fcoalesce(frollsum(grade, 3L), as.double(cumsum(grade))), id]
students[, sum_lag_3 := nafill(sum_lag_3, "locf"), id]

输出:

id year grade sum_lag_3
1:  1 2000     9         9
2:  1 2001    NA         9
3:  1 2002    NA         9
4:  1 2003     9        18
5:  1 2004    10        28
6:  1 2005    NA        28
7:  1 2006    NA        28
8:  1 2007    11        30
9:  1 2008    NA        30
10:  1 2009    10        31
11:  2 2000    11        11    <-----
12:  2 2001    11        22
13:  2 2002     9        31
14:  2 2003    NA        31
15:  2 2004    NA        31
16:  2 2005    10        30
17:  2 2006    NA        30
18:  2 2007    NA        30
19:  2 2008    10        29
20:  2 2009    NA        29
21:  3 2000     9         9
22:  3 2001    NA         9
23:  3 2002    NA         9
24:  3 2003    NA         9
25:  3 2004     9        18
26:  3 2005     9        27
27:  3 2006    NA        27
28:  3 2007    NA        27
29:  3 2008    NA        27
30:  3 2009    10        28
31:  4 2000    10        10
32:  4 2001    NA        10
33:  4 2002     9        19
34:  4 2003    NA        19
35:  4 2004    NA        19
36:  4 2005     9        28
37:  4 2006    NA        28
38:  4 2007    11        29
39:  4 2008    NA        29
40:  4 2009    10        30
41:  5 2000    10        10
42:  5 2001    NA        10
43:  5 2002    NA        10
44:  5 2003    NA        10
45:  5 2004    NA        10
46:  5 2005    NA        10
47:  5 2006    10        20
48:  5 2007    NA        20
49:  5 2008     9        29
50:  5 2009    NA        29
51:  6 2000    NA        NA
52:  6 2001     9         9
53:  6 2002    NA         9
54:  6 2003    NA         9
55:  6 2004     9        18
56:  6 2005    NA        18
57:  6 2006    NA        18
58:  6 2007    NA        18
59:  6 2008    10        28
60:  6 2009    NA        28
61:  7 2000    11        11
62:  7 2001    10        21
63:  7 2002    NA        21
64:  7 2003    NA        21
65:  7 2004    NA        21
66:  7 2005    NA        21
67:  7 2006    10        31
68:  7 2007    NA        31
69:  7 2008    10        30
70:  7 2009    NA        30
71:  8 2000    NA        NA
72:  8 2001    NA        NA
73:  8 2002     9         9
74:  8 2003    11        20
75:  8 2004    11        31
76:  8 2005    NA        31
77:  8 2006    NA        31
78:  8 2007    NA        31
79:  8 2008    NA        31
80:  8 2009    NA        31
81:  9 2000    NA        NA
82:  9 2001    NA        NA
83:  9 2002    NA        NA
84:  9 2003    11        11
85:  9 2004     9        20
86:  9 2005    NA        20
87:  9 2006    NA        20
88:  9 2007    NA        20
89:  9 2008     9        29
90:  9 2009    NA        29
91: 10 2000     9         9
92: 10 2001    NA         9
93: 10 2002    NA         9
94: 10 2003    NA         9
95: 10 2004    NA         9
96: 10 2005    NA         9
97: 10 2006    NA         9
98: 10 2007    NA         9
99: 10 2008    NA         9
100: 10 2009    NA         9
id year grade sum_lag_3

要解决OP的编辑问题:您可以遍历每个学生的每一行,以获取过去的向量和未来的向量:

#for example using sum on past grades and mean on future grades
pastFunc <- sum
futureFunc <- mean
students[, {
vapply(1L:.N, function(n) {
past <- grade[seq_len(n-1)]
future <- grade[seq_len(.N-n)+n]
sum(past, na.rm=TRUE) + mean(future, na.rm=TRUE)
}, numeric(1L))  
}, id]

相关内容

  • 没有找到相关文章

最新更新