r语言 - 包括回归中所有变量的第一个滞后



我正在寻找一种方法,包括一个数据框架的所有变量和他们的组内第一滞后在回归模型。我的数据有一个类似的形状:

df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
y= 11:18,
x1= 21:28,
x2= 1:8)
df
group  y      x1      x2
1     A     11     21      1
2     A     12     22      2
3     A     13     23      3
4     A     14     24      4
5     B     15     25      5
6     B     16     26      6
7     B     17     27      7
8     B     18     28      8

我想包括x1和x2作为回归量以及它们各自的第一滞后。重要的是,x1和x2的滞后保持在它们的A组和B组内(即我对第5行和第4行之间的差异不感兴趣)。

然而,我的真实数据集包含超过150个变量,因此手动将滞后值作为新变量添加到数据帧的方法(例如df <- df %>% group_by(group) %>% mutate(value1_lag = lag(value1)))是不可行的。

相反,我正在寻找沿着

一行的东西model <- dyn$lmf(y ~ . + lag(.), data = df)

但是它似乎不能与简单的lag()操作符一起工作。

提前感谢您的帮助!

您可以使用across:

一次性添加所有变量
df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
y= 11:18,
x1= 21:28,
x2= 1:8)
library(dplyr)
library(tidyr)
df %>% 
group_by(group) %>% 
mutate(time = row_number(), 
lag = across(-c(y, time), lag)) %>% 
unnest(lag, names_sep = "_") %>%
arrange(group, time) %>% 
group_by(group) %>% 
mutate(across(starts_with("lag_"), ~case_when(time == 1 ~ lead(.x), 
TRUE ~ .x)))
#> # A tibble: 8 × 7
#> # Groups:   group [2]
#>   group     y    x1    x2  time lag_x1 lag_x2
#>   <chr> <int> <int> <int> <int>  <int>  <int>
#> 1 A        11    21     1     1     21      1
#> 2 A        12    22     2     2     21      1
#> 3 A        13    23     3     3     22      2
#> 4 A        14    24     4     4     23      3
#> 5 B        15    25     5     1     25      5
#> 6 B        16    26     6     2     25      5
#> 7 B        17    27     7     3     26      6
#> 8 B        18    28     8     4     27      7

在2022-05-09由reprex包(v2.0.1)创建

您可以将回归量的名称grep到向量xn中,将数据by组分开,将滞后函数应用于setNames,rbind

xn <- grep('^X', names(dat), value=TRUE)
dat <- by(dat, dat$group, (X) {
cbind(X, setNames(as.data.frame(lapply(X[xn], (x) c(NA, x[-length(x)]))), paste0(xn, '.lag1')))
}) |> do.call(what='rbind')

然后是reformulate,而grep再次与collapse=xn

lm(reformulate(grep(paste(xn, collapse='|'), names(dat), value=TRUE), 'Y'), dat) |>
summary() 
# Call:
# lm(formula = reformulate(grep(paste(xn, collapse = "|"), names(dat), 
#                                 value = TRUE), "Y"), data = dat)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -1.85222 -0.31818  0.05702  0.33261  1.01141 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.6775     0.1768   3.832 0.000231 ***
# X1            1.5754     0.1688   9.335 5.22e-15 ***
# X2            2.5787     0.1750  14.737  < 2e-16 ***
# X1.lag1       0.7651     0.1652   4.630 1.18e-05 ***
# X2.lag1       1.9722     0.1777  11.098  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.4892 on 93 degrees of freedom
# (2 observations deleted due to missingness)
# Multiple R-squared:  0.8234,  Adjusted R-squared:  0.8158 
# F-statistic: 108.4 on 4 and 93 DF,  p-value: < 2.2e-16

数据:

dat <- structure(list(X1 = c(0.915, 0.937, 0.286, 0.83, 0.642, 0.519, 
0.737, 0.135, 0.657, 0.705, 0.458, 0.719, 0.935, 0.255, 0.462, 
0.94, 0.978, 0.117, 0.475, 0.56, 0.904, 0.139, 0.989, 0.947, 
0.082, 0.514, 0.39, 0.906, 0.447, 0.836, 0.738, 0.811, 0.388, 
0.685, 0.004, 0.833, 0.007, 0.208, 0.907, 0.612, 0.38, 0.436, 
0.037, 0.974, 0.432, 0.958, 0.888, 0.64, 0.971, 0.619, 0.852, 
0.443, 0.158, 0.442, 0.968, 0.485, 0.252, 0.26, 0.542, 0.65, 
0.336, 0.061, 0.451, 0.839, 0.575, 0.353, 0.547, 0.893, 0.49, 
0.172, 0.543, 0.961, 0.314, 0.821, 0.307, 0.185, 0.048, 0.246, 
0.351, 0.159, 0.304, 0.018, 0.997, 0.804, 0.087, 0.87, 0.555, 
0.421, 0.068, 0.561, 0.071, 0.211, 0.55, 0.482, 0.159, 0.15, 
0.499, 0.941, 0.334, 0.188), X2 = c(0.333, 0.347, 0.398, 0.785, 
0.039, 0.749, 0.677, 0.171, 0.261, 0.514, 0.676, 0.983, 0.76, 
0.566, 0.85, 0.189, 0.271, 0.828, 0.693, 0.241, 0.043, 0.14, 
0.216, 0.479, 0.197, 0.719, 0.008, 0.375, 0.514, 0.002, 0.582, 
0.158, 0.359, 0.646, 0.776, 0.564, 0.234, 0.09, 0.086, 0.305, 
0.667, 0, 0.209, 0.933, 0.926, 0.734, 0.333, 0.515, 0.744, 0.619, 
0.27, 0.531, 0.021, 0.799, 0.11, 0.54, 0.571, 0.619, 0.715, 0.123, 
0.311, 0.946, 0.5, 0.135, 0.869, 0.205, 0.925, 0.887, 0.136, 
0.785, 0.453, 0.136, 0.885, 0.337, 0.319, 0.404, 0.479, 0.368, 
0.466, 0.05, 0.187, 0.983, 0.328, 0.171, 0.488, 0.019, 0.339, 
0.03, 0.867, 0.732, 0.315, 0.386, 0.332, 0.09, 0.757, 0.603, 
0.145, 0.033, 0.484, 0.445), Y = c(3.499, 5.284, 3.72, 5.143, 
3.938, 4.344, 5.567, 1.378, 2.717, 3.949, 4.769, 6.487, 7.533, 
4.3, 5.513, 4.778, 4.609, 4.701, 5.164, 3.182, 2.933, 2.001, 
2.57, 4.482, 2.953, 4.048, 2.989, 2.901, 3.544, 3.909, 4.188, 
4.041, 2.92, 3.816, 4.618, 4.804, 2.585, 1.627, 2.869, 3.913, 
3.746, 2.912, 1.832, 4.645, 6.05, 5.835, 4.471, 4.419, 6.171, 
6.155, 3.79, 3.859, 2.571, 2.721, 4.324, 3.452, 3.792, 4.636, 
4.482, 3.432, 3.037, 3.83, 4.799, 4.377, 4.026, 4.404, 4.303, 
6.612, 4.414, 4.023, 3.419, 4.627, 4.849, 4.603, 2.815, 3.36, 
3.341, 3.085, 3.44, 2.349, 2.222, 3.909, 4.596, 3.231, 3.66, 
3.347, 2.823, 2.017, 3.485, 4.754, 3.952, 2.139, 2.819, 3.519, 
3.637, 4.422, 4.153, 2.96, 2.407, 3.646), group = c("A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B")), row.names = c(NA, -100L), class = "data.frame")

这个问题有几个问题:

  1. dyn不是纵向数据的正确包。
  2. 问题中显示的df不是一个很好的例子,因为由于使用线性序列,列是线性相关的,所以尝试这个例子,我们将使用mtcars中的mpg作为因变量,cyl作为分组变量,hp和disp作为自变量。

请注意,在不同的包中有多个延迟版本,在基础R中也是如此,因此我们每次都明确地引用我们想要的版本。

1) plmplm包具有专门用于面板数据的功能。首先添加一个时间列,然后使用plm。plm model=参数可以提供许多其他可能性。PLM假设第一列是group,第二列是time。

library(plm)
mtcars2 <- transform(mtcars, time = ave(1:nrow(mtcars), cyl, FUN = seq_along))
indep <- names(mtcars2)[3:4]
L01 <- function(x) plm::lag(x, 0:1)
fo <- reformulate(sprintf("L01(%s)", indep), "mpg")
fo
## mpg ~ L01(disp) + L01(hp)
plm(fo, data = mtcars2, index = c("cyl", "time"), model = "pooling")

给:

Model Formula: mpg ~ L01(disp) + L01(hp)
Coefficients:
(Intercept)  L01(disp)0  L01(disp)1    L01(hp)0    L01(hp)1 
31.0212451  -0.0327843  -0.0019717  -0.0243698   0.0051201 

2) lm使用dplyr计算滞后变量,使用reformula计算公式,然后调用lm。

library(dplyr)
indep <- names(mtcars2)[3:4]
mtcars2 <- mtcars %>%
group_by(cyl = factor(cyl)) %>%
mutate(across(all_of(indep), dplyr::lag, .names = "lag.{col}")) %>%
ungroup
fo <- reformulate(c(indep, paste0("lag.", indep)), "mpg")
fo
## mpg ~ disp + hp + lag.disp + lag.hp
lm(fo, mtcars2, na.action = na.exclude)

给出相同的系数:

Call:
lm(formula = fo, data = mtcars2, na.action = na.exclude)
Coefficients:
(Intercept)         disp           hp     lag.disp       lag.hp  
31.021245    -0.032784    -0.024370    -0.001972     0.005120  

更新重组的答案。

相关内容

最新更新