r-dplyr滚动条件计数



我有一个数据帧,如下所示:

df <- data.frame(
Item=c("A","A","A","A","A","B","B","B","B","B"),
Date=c("2018-1-1","2018-2-1","2018-3-1","2018-4-1","2018-5-1","2018-1-1","2018-2-1",
"2018-3-1","2018-4-1","2018-5-1"),
Value=rnorm(10))

我想改变一个按Item分组的新列,以计算3(或我指定的任何其他整数(窗口内大于0的值的数量。

我熟悉tidyverse,因此,非常欢迎使用dplyr解决方案。

如果您想滚动任何内容,请考虑zoo::包。

df$new<-
zoo::rollsum( df$Value > 0, 3, fill = NA )
#   Item     Date      Value new
#1     A 2018-1-1  0.5852699  NA
#2     A 2018-2-1 -0.7383377   1
#3     A 2018-3-1 -0.3157693   1
#4     A 2018-4-1  1.2475237   1
#5     A 2018-5-1 -1.5479757   1
#6     B 2018-1-1 -0.6913331   0
#7     B 2018-2-1 -0.2423809   0
#8     B 2018-3-1 -1.6363024   0
#9     B 2018-4-1 -0.3256263   1
#10    B 2018-5-1  0.3563144  NA

您可以选择"窗口位置"。仔细看一下论点align = c("center", "left", "right")


因此作为dplyr链:

df %>% group_by(Item) %>% dplyr::mutate( new = zoo::rollsum( Value > 0, 3, fill = NA ))

您可以使用RcppRoll包。

require(RcppRoll)
df$new <- df$new <- RcppRoll::roll_sum(df$Value > 0, 3, fill = NA)

使用Tidyverse:

df %>% 
group_by(Item) %>% 
dplyr::mutate(new = RcppRoll::roll_sum(Value > 0, 3, fill = NA))

在速度上,这比zoo包更快:

n <- 10000
df <- data.frame(
Item = sample(LETTERS, n, replace = TRUE),
Value = rnorm(n))
df_grouped <- df %>% 
group_by(Item)
microbenchmark::microbenchmark(
RcppRoll = df_grouped <- df_grouped %>% dplyr::mutate(new_RcppRoll = RcppRoll::roll_sum(Value > 0, 3, fill = NA)),
zoo = df_grouped <- df_grouped %>% dplyr::mutate(new_zoo = zoo::rollsum( Value > 0, 3, fill = NA ))
)

结果:

Unit: milliseconds
expr       min        lq      mean   median        uq       max neval
RcppRoll  2.509003  2.741993  2.929227  2.83913  2.983726  5.832962   100
zoo 11.172920 11.785113 13.288970 12.43320 13.607826 25.879754   100

all.equal(df_grouped$new_RcppRoll, df_grouped$new_zoo)
TRUE
Item  Date       Value
<fct> <date>     <int>
1 A     2018-01-01     3
2 B     2018-01-01     2
3 B     2018-02-01    -5
4 A     2018-02-01    -3
5 A     2018-03-01     4
6 B     2018-03-01    -2
7 A     2018-04-01     5
8 B     2018-04-01     0
9 A     2018-05-01     1
10 B     2018-05-01    -4

为清晰起见,更改了rnorm示例,使用的样本(-5:5(:

> df <- df %>% mutate(greater_than = (Value>0)*Value) %>%
group_by(Item) %>% arrange(Date) %>% mutate(greater_than = 
zoo::rollapplyr(greater_than, 3, sum, partial = T))
df %>% arrange(Item) %>% head(10)

应该是这样的:

1 A     2018-01-01     3            3
2 A     2018-02-01    -3            3
3 A     2018-03-01     4            7
4 A     2018-04-01     5            9
5 A     2018-05-01     1           10
6 B     2018-01-01     2            2
7 B     2018-02-01    -5            2
8 B     2018-03-01    -2            2
9 B     2018-04-01     0            0
10 B     2018-05-01    -4            0

最新更新