r语言 - 如何创建一个依赖于以前观察到的事件的平均值的列?



在下面的数据中,我们观察到一个特定国家在一段时间内的虚拟GDP增长。我的目标是创建一个包含三个类别的变量:0=无危机,1=危机,2=严重危机。这就是识别经济危机是指增长率低于前三年增长趋势平均值至少一个(危机)或两个(严重)标准差的年份。

有人能给点指导吗?

growth  year
5   1990
4   1991
0   1992
-4  1993
-3  1994
-1  1995
2   1996
4   1997
7   1998
10  1999
8   2000
-10 2001
-8  2002
2   2003
4   2004
5   2005
8   2006
4   2007
-10 2008
-9  2009
-8  2010
-3  2011
0   2012
-5  2013
-6  2014
-2  2015
4   2016
5   2017
5   2018
8   2019
2   2020
-1  2021
-1  2022

数据如下:

df=structure(list(gdp_growth = c(5, 4, 0, -4, -3, -1, 2, 4, 7, 10, 
8, -10, -8, 2, 4, 5, 8, 4, -10, -9, -8, -3, 0, -5, -6, -2, 4, 
5, 5, 8, 2, -1, -1), year = c(1990, 1991, 1992, 1993, 1994, 1995, 
1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 
2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 
2018, 2019, 2020, 2021, 2022)), row.names = c(NA, -33L), class = "data.frame")

从你的描述中,听起来好像你首先需要计算增长的滚动平均值,然后将今年的增长与此进行比较:

library(dplyr)
df %>% 
mutate(mn = zoo::rollmean(gdp_growth, 3, na.pad = TRUE, align = 'right'),
crisis = ifelse(gdp_growth < (mn - sd(gdp_growth)),
ifelse(gdp_growth < (mn - 2 * sd(gdp_growth)),
2, 1), 0)) %>%
select(-mn)
#>    gdp_growth year crisis
#> 1           5 1990     NA
#> 2           4 1991     NA
#> 3           0 1992      0
#> 4          -4 1993      0
#> 5          -3 1994      0
#> 6          -1 1995      0
#> 7           2 1996      0
#> 8           4 1997      0
#> 9           7 1998      0
#> 10         10 1999      0
#> 11          8 2000      0
#> 12        -10 2001      2
#> 13         -8 2002      0
#> 14          2 2003      0
#> 15          4 2004      0
#> 16          5 2005      0
#> 17          8 2006      0
#> 18          4 2007      0
#> 19        -10 2008      1
#> 20         -9 2009      0
#> 21         -8 2010      0
#> 22         -3 2011      0
#> 23          0 2012      0
#> 24         -5 2013      0
#> 25         -6 2014      0
#> 26         -2 2015      0
#> 27          4 2016      0
#> 28          5 2017      0
#> 29          5 2018      0
#> 30          8 2019      0
#> 31          2 2020      0
#> 32         -1 2021      0
#> 33         -1 2022      0

这里是另一个例子,这次使用RcppRoll包,它具有与dplyr兼容的大量快速滚动功能。

library(dplyr)

df %>%
mutate(
std3 = RcppRoll::roll_sd(gdp_growth , 3, fill=0, align = "right"),
crisis = case_when(
std3 < 1 ~ 'no crisis',
std3 < 2 ~ 'crisis',
T ~ 'severe crisis'
)
)
#>    gdp_growth year       std3        crisis
#> 1           5 1990  0.0000000     no crisis
#> 2           4 1991  0.0000000     no crisis
#> 3           0 1992  2.6457513 severe crisis
#> 4          -4 1993  4.0000000 severe crisis
#> 5          -3 1994  2.0816660 severe crisis
#> 6          -1 1995  1.5275252        crisis
#> 7           2 1996  2.5166115 severe crisis
#> 8           4 1997  2.5166115 severe crisis
#> 9           7 1998  2.5166115 severe crisis
#> 10         10 1999  3.0000000 severe crisis
#> 11          8 2000  1.5275252        crisis
#> 12        -10 2001 11.0151411 severe crisis
#> 13         -8 2002  9.8657657 severe crisis
#> 14          2 2003  6.4291005 severe crisis
#> 15          4 2004  6.4291005 severe crisis
#> 16          5 2005  1.5275252        crisis
#> 17          8 2006  2.0816660 severe crisis
#> 18          4 2007  2.0816660 severe crisis
#> 19        -10 2008  9.4516313 severe crisis
#> 20         -9 2009  7.8102497 severe crisis
#> 21         -8 2010  1.0000000        crisis
#> 22         -3 2011  3.2145503 severe crisis
#> 23          0 2012  4.0414519 severe crisis
#> 24         -5 2013  2.5166115 severe crisis
#> 25         -6 2014  3.2145503 severe crisis
#> 26         -2 2015  2.0816660 severe crisis
#> 27          4 2016  5.0332230 severe crisis
#> 28          5 2017  3.7859389 severe crisis
#> 29          5 2018  0.5773503     no crisis
#> 30          8 2019  1.7320508        crisis
#> 31          2 2020  3.0000000 severe crisis
#> 32         -1 2021  4.5825757 severe crisis
#> 33         -1 2022  1.7320508        crisis

由reprex包(v2.0.1)在2022-07-11创建

您可以在dplyr中使用lag,rowwise*和mutate:

library(dplyr)
df |>
mutate(gdp3_growth_lag1 = lag(gdp_growth, 1),
gdp3_growth_lag2 = lag(gdp_growth, 2),
gdp3_growth_lag3 = lag(gdp_growth, 3)) |>
rowwise() |>
mutate(
gdp3_growth_mean = mean(c_across(starts_with("gdp3_growth_lag"))),
gdp3_growth_sd = sd(c_across(starts_with("gdp3_growth_lag")))
) |>
ungroup() |>
mutate(crisis = case_when(gdp_growth <= gdp3_growth_mean - 2 * gdp3_growth_sd ~ 2,
gdp_growth <= gdp3_growth_mean - gdp3_growth_sd ~ 1,
is.na(gdp3_growth_mean) ~ NA_real_,
TRUE ~ 0)) |>
select(-starts_with("gdp3"))

输出:

# A tibble: 33 × 3
gdp_growth  year crisis
<dbl> <dbl>  <dbl>
1          5  1990     NA
2          4  1991     NA
3          0  1992     NA
4         -4  1993      2
5         -3  1994      0
6         -1  1995      0
7          2  1996      0
8          4  1997      0
9          7  1998      0
10         10  1999      0
11          8  2000      0
12        -10  2001      2
13         -8  2002      0
14          2  2003      0
15          4  2004      0
16          5  2005      0
17          8  2006      0
18          4  2007      0
19        -10  2008      2
20         -9  2009      1
21         -8  2010      0
22         -3  2011      0
23          0  2012      0
24         -5  2013      0
25         -6  2014      1
26         -2  2015      0
27          4  2016      0
28          5  2017      0
29          5  2018      0
30          8  2019      0
31          2  2020      2
32         -1  2021      2
33         -1  2022      0

更新完整输出

(*)matrixStats中也有rowSds

相关内容

  • 没有找到相关文章

最新更新