我正在寻找一种方法,包括一个数据框架的所有变量和他们的组内第一滞后在回归模型。我的数据有一个类似的形状:
df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
y= 11:18,
x1= 21:28,
x2= 1:8)
df
group y x1 x2
1 A 11 21 1
2 A 12 22 2
3 A 13 23 3
4 A 14 24 4
5 B 15 25 5
6 B 16 26 6
7 B 17 27 7
8 B 18 28 8
我想包括x1和x2作为回归量以及它们各自的第一滞后。重要的是,x1和x2的滞后保持在它们的A组和B组内(即我对第5行和第4行之间的差异不感兴趣)。
然而,我的真实数据集包含超过150个变量,因此手动将滞后值作为新变量添加到数据帧的方法(例如df <- df %>% group_by(group) %>% mutate(value1_lag = lag(value1))
)是不可行的。
相反,我正在寻找沿着
一行的东西model <- dyn$lmf(y ~ . + lag(.), data = df)
但是它似乎不能与简单的lag()操作符一起工作。
提前感谢您的帮助!
您可以使用across
:
df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
y= 11:18,
x1= 21:28,
x2= 1:8)
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
mutate(time = row_number(),
lag = across(-c(y, time), lag)) %>%
unnest(lag, names_sep = "_") %>%
arrange(group, time) %>%
group_by(group) %>%
mutate(across(starts_with("lag_"), ~case_when(time == 1 ~ lead(.x),
TRUE ~ .x)))
#> # A tibble: 8 × 7
#> # Groups: group [2]
#> group y x1 x2 time lag_x1 lag_x2
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 11 21 1 1 21 1
#> 2 A 12 22 2 2 21 1
#> 3 A 13 23 3 3 22 2
#> 4 A 14 24 4 4 23 3
#> 5 B 15 25 5 1 25 5
#> 6 B 16 26 6 2 25 5
#> 7 B 17 27 7 3 26 6
#> 8 B 18 28 8 4 27 7
在2022-05-09由reprex包(v2.0.1)创建
您可以将回归量的名称grep
到向量xn
中,将数据by
组分开,将滞后函数应用于setNames
,rbind
。
xn <- grep('^X', names(dat), value=TRUE)
dat <- by(dat, dat$group, (X) {
cbind(X, setNames(as.data.frame(lapply(X[xn], (x) c(NA, x[-length(x)]))), paste0(xn, '.lag1')))
}) |> do.call(what='rbind')
然后是reformulate
,而grep
再次与collapse=
和xn
。
lm(reformulate(grep(paste(xn, collapse='|'), names(dat), value=TRUE), 'Y'), dat) |>
summary()
# Call:
# lm(formula = reformulate(grep(paste(xn, collapse = "|"), names(dat),
# value = TRUE), "Y"), data = dat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.85222 -0.31818 0.05702 0.33261 1.01141
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6775 0.1768 3.832 0.000231 ***
# X1 1.5754 0.1688 9.335 5.22e-15 ***
# X2 2.5787 0.1750 14.737 < 2e-16 ***
# X1.lag1 0.7651 0.1652 4.630 1.18e-05 ***
# X2.lag1 1.9722 0.1777 11.098 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4892 on 93 degrees of freedom
# (2 observations deleted due to missingness)
# Multiple R-squared: 0.8234, Adjusted R-squared: 0.8158
# F-statistic: 108.4 on 4 and 93 DF, p-value: < 2.2e-16
数据:
dat <- structure(list(X1 = c(0.915, 0.937, 0.286, 0.83, 0.642, 0.519,
0.737, 0.135, 0.657, 0.705, 0.458, 0.719, 0.935, 0.255, 0.462,
0.94, 0.978, 0.117, 0.475, 0.56, 0.904, 0.139, 0.989, 0.947,
0.082, 0.514, 0.39, 0.906, 0.447, 0.836, 0.738, 0.811, 0.388,
0.685, 0.004, 0.833, 0.007, 0.208, 0.907, 0.612, 0.38, 0.436,
0.037, 0.974, 0.432, 0.958, 0.888, 0.64, 0.971, 0.619, 0.852,
0.443, 0.158, 0.442, 0.968, 0.485, 0.252, 0.26, 0.542, 0.65,
0.336, 0.061, 0.451, 0.839, 0.575, 0.353, 0.547, 0.893, 0.49,
0.172, 0.543, 0.961, 0.314, 0.821, 0.307, 0.185, 0.048, 0.246,
0.351, 0.159, 0.304, 0.018, 0.997, 0.804, 0.087, 0.87, 0.555,
0.421, 0.068, 0.561, 0.071, 0.211, 0.55, 0.482, 0.159, 0.15,
0.499, 0.941, 0.334, 0.188), X2 = c(0.333, 0.347, 0.398, 0.785,
0.039, 0.749, 0.677, 0.171, 0.261, 0.514, 0.676, 0.983, 0.76,
0.566, 0.85, 0.189, 0.271, 0.828, 0.693, 0.241, 0.043, 0.14,
0.216, 0.479, 0.197, 0.719, 0.008, 0.375, 0.514, 0.002, 0.582,
0.158, 0.359, 0.646, 0.776, 0.564, 0.234, 0.09, 0.086, 0.305,
0.667, 0, 0.209, 0.933, 0.926, 0.734, 0.333, 0.515, 0.744, 0.619,
0.27, 0.531, 0.021, 0.799, 0.11, 0.54, 0.571, 0.619, 0.715, 0.123,
0.311, 0.946, 0.5, 0.135, 0.869, 0.205, 0.925, 0.887, 0.136,
0.785, 0.453, 0.136, 0.885, 0.337, 0.319, 0.404, 0.479, 0.368,
0.466, 0.05, 0.187, 0.983, 0.328, 0.171, 0.488, 0.019, 0.339,
0.03, 0.867, 0.732, 0.315, 0.386, 0.332, 0.09, 0.757, 0.603,
0.145, 0.033, 0.484, 0.445), Y = c(3.499, 5.284, 3.72, 5.143,
3.938, 4.344, 5.567, 1.378, 2.717, 3.949, 4.769, 6.487, 7.533,
4.3, 5.513, 4.778, 4.609, 4.701, 5.164, 3.182, 2.933, 2.001,
2.57, 4.482, 2.953, 4.048, 2.989, 2.901, 3.544, 3.909, 4.188,
4.041, 2.92, 3.816, 4.618, 4.804, 2.585, 1.627, 2.869, 3.913,
3.746, 2.912, 1.832, 4.645, 6.05, 5.835, 4.471, 4.419, 6.171,
6.155, 3.79, 3.859, 2.571, 2.721, 4.324, 3.452, 3.792, 4.636,
4.482, 3.432, 3.037, 3.83, 4.799, 4.377, 4.026, 4.404, 4.303,
6.612, 4.414, 4.023, 3.419, 4.627, 4.849, 4.603, 2.815, 3.36,
3.341, 3.085, 3.44, 2.349, 2.222, 3.909, 4.596, 3.231, 3.66,
3.347, 2.823, 2.017, 3.485, 4.754, 3.952, 2.139, 2.819, 3.519,
3.637, 4.422, 4.153, 2.96, 2.407, 3.646), group = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B")), row.names = c(NA, -100L), class = "data.frame")
这个问题有几个问题:
- dyn不是纵向数据的正确包。
- 问题中显示的df不是一个很好的例子,因为由于使用线性序列,列是线性相关的,所以尝试这个例子,我们将使用mtcars中的mpg作为因变量,cyl作为分组变量,hp和disp作为自变量。
请注意,在不同的包中有多个延迟版本,在基础R中也是如此,因此我们每次都明确地引用我们想要的版本。
1) plmplm包具有专门用于面板数据的功能。首先添加一个时间列,然后使用plm。plm model=参数可以提供许多其他可能性。PLM假设第一列是group,第二列是time。
library(plm)
mtcars2 <- transform(mtcars, time = ave(1:nrow(mtcars), cyl, FUN = seq_along))
indep <- names(mtcars2)[3:4]
L01 <- function(x) plm::lag(x, 0:1)
fo <- reformulate(sprintf("L01(%s)", indep), "mpg")
fo
## mpg ~ L01(disp) + L01(hp)
plm(fo, data = mtcars2, index = c("cyl", "time"), model = "pooling")
给:
Model Formula: mpg ~ L01(disp) + L01(hp)
Coefficients:
(Intercept) L01(disp)0 L01(disp)1 L01(hp)0 L01(hp)1
31.0212451 -0.0327843 -0.0019717 -0.0243698 0.0051201
2) lm使用dplyr计算滞后变量,使用reformula计算公式,然后调用lm。
library(dplyr)
indep <- names(mtcars2)[3:4]
mtcars2 <- mtcars %>%
group_by(cyl = factor(cyl)) %>%
mutate(across(all_of(indep), dplyr::lag, .names = "lag.{col}")) %>%
ungroup
fo <- reformulate(c(indep, paste0("lag.", indep)), "mpg")
fo
## mpg ~ disp + hp + lag.disp + lag.hp
lm(fo, mtcars2, na.action = na.exclude)
给出相同的系数:
Call:
lm(formula = fo, data = mtcars2, na.action = na.exclude)
Coefficients:
(Intercept) disp hp lag.disp lag.hp
31.021245 -0.032784 -0.024370 -0.001972 0.005120
更新重组的答案。