r-为每个唯一ID创建滞后变量



我有一个不同ID、日期和数量的数据集。我创建了一个差异变量,然后想为每个唯一的ID创建滞后变量。

df <- data.frame(ID = c("1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2",
"3", "3", "3", "3", "3", "3"),
Date = c(seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6)),
QTY = c(143, 23, 432, 43, 432, 4645, 3762, 432, 432, 4645, 3762, 432, 122, 42, 43,655, 4645, 3762))
df <- transform(df, diff = ave(QTY, ID, FUN = function(x) c(NA, diff(x))))

我设法做到了,但滞后变量是使用其他ID以及计算的

for(i in unique(df$ID)) {
setDT(df)[, c("lag1","lag2", "lag3",
"lag4",
"lag5") := .(shift(diff, 1L, fill = NA, type = "lag"),
shift(diff, 2L, fill = NA, type = "lag"),
shift(diff, 3L, fill = NA, type = "lag"),
shift(diff, 4L, fill = NA, type = "lag"),
shift(diff, 5L, fill = NA, type = "lag")
)]
}

我的结果是这样的


ID  Date        QTY    diff lag1 lag2 lag3 lag4 lag5
1   1   2000-01-01  143    NA   NA  NA  NA  NA  NA
2   1   2000-02-01  23    -120  NA  NA  NA  NA  NA
3   1   2000-03-01  432   409   -120    NA  NA  NA  NA
4   1   2000-04-01  43    -389  409 -120    NA  NA  NA
5   1   2000-05-01  432   389   -389    409 -120    NA  NA
6   1   2000-06-01  4645  4213  389 -389    409 -120    NA
7   2   2000-01-01  3762  NA    4213    389 -389    409 -120
8   2   2000-02-01  432   -3330 NA  4213    389 -389    409
9   2   2000-03-01  432   0 -3330   NA  4213    389 -389
10  2   2000-04-01  4645  4213  0   -3330   NA  4213    389
11  2   2000-05-01  3762  -883  4213    0   -3330   NA  4213
12  2   2000-06-01  432   -3330 -883    4213    0   -3330   NA
13  3   2000-01-01  122   NA    -3330   -883    4213    0   -3330
14  3   2000-02-01  42   -80    NA  -3330   -883    4213    0
15  3   2000-03-01  43    1 -80 NA  -3330   -883    4213
16  3   2000-04-01  655   612   1   -80 NA  -3330   -883
17  3   2000-05-01  4645  3990  612 1   -80 NA  -3330
18  3   2000-06-01  3762  -883  3990    612 1   -80 NA

所以基本上第一个ID是正确计算的,而不是其他ID,因为它考虑了以前的ID。我该怎么解决这个问题?

使用data.table:

library(data.table)
n <- 1:5
setDT(df)[, paste0("lag", n) := shift(diff, n), ID]
df
#    ID       Date  QTY  diff  lag1  lag2  lag3  lag4 lag5
# 1:  1 2000-01-01  143    NA    NA    NA    NA    NA   NA
# 2:  1 2000-02-01   23  -120    NA    NA    NA    NA   NA
# 3:  1 2000-03-01  432   409  -120    NA    NA    NA   NA
# 4:  1 2000-04-01   43  -389   409  -120    NA    NA   NA
# 5:  1 2000-05-01  432   389  -389   409  -120    NA   NA
# 6:  1 2000-06-01 4645  4213   389  -389   409  -120   NA
# 7:  2 2000-01-01 3762    NA    NA    NA    NA    NA   NA
# 8:  2 2000-02-01  432 -3330    NA    NA    NA    NA   NA
# 9:  2 2000-03-01  432     0 -3330    NA    NA    NA   NA
#10:  2 2000-04-01 4645  4213     0 -3330    NA    NA   NA
#11:  2 2000-05-01 3762  -883  4213     0 -3330    NA   NA
#12:  2 2000-06-01  432 -3330  -883  4213     0 -3330   NA
#13:  3 2000-01-01  122    NA    NA    NA    NA    NA   NA
#14:  3 2000-02-01   42   -80    NA    NA    NA    NA   NA
#15:  3 2000-03-01   43     1   -80    NA    NA    NA   NA
#16:  3 2000-04-01  655   612     1   -80    NA    NA   NA
#17:  3 2000-05-01 4645  3990   612     1   -80    NA   NA
#18:  3 2000-06-01 3762  -883  3990   612     1   -80   NA

涉及purrrdplyr的一个选项可能是:

bind_cols(df, map(.x = 1:5, ~ df %>%
group_by(ID) %>%
mutate(diff = c(NA, diff(QTY)),
!!paste0("lag", .x) := lag(diff, n = .x)) %>%
ungroup() %>%
select(starts_with("lag"))))
ID       Date  QTY  lag1  lag2  lag3  lag4 lag5
1   1 2000-01-01  143    NA    NA    NA    NA   NA
2   1 2000-02-01   23    NA    NA    NA    NA   NA
3   1 2000-03-01  432  -120    NA    NA    NA   NA
4   1 2000-04-01   43   409  -120    NA    NA   NA
5   1 2000-05-01  432  -389   409  -120    NA   NA
6   1 2000-06-01 4645   389  -389   409  -120   NA
7   2 2000-01-01 3762    NA    NA    NA    NA   NA
8   2 2000-02-01  432    NA    NA    NA    NA   NA
9   2 2000-03-01  432 -3330    NA    NA    NA   NA
10  2 2000-04-01 4645     0 -3330    NA    NA   NA
11  2 2000-05-01 3762  4213     0 -3330    NA   NA
12  2 2000-06-01  432  -883  4213     0 -3330   NA
13  3 2000-01-01  122    NA    NA    NA    NA   NA
14  3 2000-02-01   42    NA    NA    NA    NA   NA
15  3 2000-03-01   43   -80    NA    NA    NA   NA
16  3 2000-04-01  655     1   -80    NA    NA   NA
17  3 2000-05-01 4645   612     1   -80    NA   NA
18  3 2000-06-01 3762  3990   612     1   -80   NA

也可以只使用dplyr

library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(diff = QTY-lag(QTY),
lag1 = lag(diff),
lag2 = lag(lag1),
lag3 = lag(lag2),
lag4 = lag(lag3),
lag5 = lag(lag4))

最新更新