我有一个不同ID、日期和数量的数据集。我创建了一个差异变量,然后想为每个唯一的ID创建滞后变量。
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2",
"3", "3", "3", "3", "3", "3"),
Date = c(seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6)),
QTY = c(143, 23, 432, 43, 432, 4645, 3762, 432, 432, 4645, 3762, 432, 122, 42, 43,655, 4645, 3762))
df <- transform(df, diff = ave(QTY, ID, FUN = function(x) c(NA, diff(x))))
我设法做到了,但滞后变量是使用其他ID以及计算的
for(i in unique(df$ID)) {
setDT(df)[, c("lag1","lag2", "lag3",
"lag4",
"lag5") := .(shift(diff, 1L, fill = NA, type = "lag"),
shift(diff, 2L, fill = NA, type = "lag"),
shift(diff, 3L, fill = NA, type = "lag"),
shift(diff, 4L, fill = NA, type = "lag"),
shift(diff, 5L, fill = NA, type = "lag")
)]
}
我的结果是这样的
ID Date QTY diff lag1 lag2 lag3 lag4 lag5
1 1 2000-01-01 143 NA NA NA NA NA NA
2 1 2000-02-01 23 -120 NA NA NA NA NA
3 1 2000-03-01 432 409 -120 NA NA NA NA
4 1 2000-04-01 43 -389 409 -120 NA NA NA
5 1 2000-05-01 432 389 -389 409 -120 NA NA
6 1 2000-06-01 4645 4213 389 -389 409 -120 NA
7 2 2000-01-01 3762 NA 4213 389 -389 409 -120
8 2 2000-02-01 432 -3330 NA 4213 389 -389 409
9 2 2000-03-01 432 0 -3330 NA 4213 389 -389
10 2 2000-04-01 4645 4213 0 -3330 NA 4213 389
11 2 2000-05-01 3762 -883 4213 0 -3330 NA 4213
12 2 2000-06-01 432 -3330 -883 4213 0 -3330 NA
13 3 2000-01-01 122 NA -3330 -883 4213 0 -3330
14 3 2000-02-01 42 -80 NA -3330 -883 4213 0
15 3 2000-03-01 43 1 -80 NA -3330 -883 4213
16 3 2000-04-01 655 612 1 -80 NA -3330 -883
17 3 2000-05-01 4645 3990 612 1 -80 NA -3330
18 3 2000-06-01 3762 -883 3990 612 1 -80 NA
所以基本上第一个ID是正确计算的,而不是其他ID,因为它考虑了以前的ID。我该怎么解决这个问题?
使用data.table
:
library(data.table)
n <- 1:5
setDT(df)[, paste0("lag", n) := shift(diff, n), ID]
df
# ID Date QTY diff lag1 lag2 lag3 lag4 lag5
# 1: 1 2000-01-01 143 NA NA NA NA NA NA
# 2: 1 2000-02-01 23 -120 NA NA NA NA NA
# 3: 1 2000-03-01 432 409 -120 NA NA NA NA
# 4: 1 2000-04-01 43 -389 409 -120 NA NA NA
# 5: 1 2000-05-01 432 389 -389 409 -120 NA NA
# 6: 1 2000-06-01 4645 4213 389 -389 409 -120 NA
# 7: 2 2000-01-01 3762 NA NA NA NA NA NA
# 8: 2 2000-02-01 432 -3330 NA NA NA NA NA
# 9: 2 2000-03-01 432 0 -3330 NA NA NA NA
#10: 2 2000-04-01 4645 4213 0 -3330 NA NA NA
#11: 2 2000-05-01 3762 -883 4213 0 -3330 NA NA
#12: 2 2000-06-01 432 -3330 -883 4213 0 -3330 NA
#13: 3 2000-01-01 122 NA NA NA NA NA NA
#14: 3 2000-02-01 42 -80 NA NA NA NA NA
#15: 3 2000-03-01 43 1 -80 NA NA NA NA
#16: 3 2000-04-01 655 612 1 -80 NA NA NA
#17: 3 2000-05-01 4645 3990 612 1 -80 NA NA
#18: 3 2000-06-01 3762 -883 3990 612 1 -80 NA
涉及purrr
和dplyr
的一个选项可能是:
bind_cols(df, map(.x = 1:5, ~ df %>%
group_by(ID) %>%
mutate(diff = c(NA, diff(QTY)),
!!paste0("lag", .x) := lag(diff, n = .x)) %>%
ungroup() %>%
select(starts_with("lag"))))
ID Date QTY lag1 lag2 lag3 lag4 lag5
1 1 2000-01-01 143 NA NA NA NA NA
2 1 2000-02-01 23 NA NA NA NA NA
3 1 2000-03-01 432 -120 NA NA NA NA
4 1 2000-04-01 43 409 -120 NA NA NA
5 1 2000-05-01 432 -389 409 -120 NA NA
6 1 2000-06-01 4645 389 -389 409 -120 NA
7 2 2000-01-01 3762 NA NA NA NA NA
8 2 2000-02-01 432 NA NA NA NA NA
9 2 2000-03-01 432 -3330 NA NA NA NA
10 2 2000-04-01 4645 0 -3330 NA NA NA
11 2 2000-05-01 3762 4213 0 -3330 NA NA
12 2 2000-06-01 432 -883 4213 0 -3330 NA
13 3 2000-01-01 122 NA NA NA NA NA
14 3 2000-02-01 42 NA NA NA NA NA
15 3 2000-03-01 43 -80 NA NA NA NA
16 3 2000-04-01 655 1 -80 NA NA NA
17 3 2000-05-01 4645 612 1 -80 NA NA
18 3 2000-06-01 3762 3990 612 1 -80 NA
也可以只使用dplyr
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(diff = QTY-lag(QTY),
lag1 = lag(diff),
lag2 = lag(lag1),
lag3 = lag(lag2),
lag4 = lag(lag3),
lag5 = lag(lag4))