我有一个数据帧,如下所示:
df <- data.frame(
Item=c("A","A","A","A","A","B","B","B","B","B"),
Date=c("2018-1-1","2018-2-1","2018-3-1","2018-4-1","2018-5-1","2018-1-1","2018-2-1",
"2018-3-1","2018-4-1","2018-5-1"),
Value=rnorm(10))
我想改变一个按Item分组的新列,以计算3(或我指定的任何其他整数(窗口内大于0的值的数量。
我熟悉tidyverse
,因此,非常欢迎使用dplyr
解决方案。
如果您想滚动任何内容,请考虑zoo::
包。
df$new<-
zoo::rollsum( df$Value > 0, 3, fill = NA )
# Item Date Value new
#1 A 2018-1-1 0.5852699 NA
#2 A 2018-2-1 -0.7383377 1
#3 A 2018-3-1 -0.3157693 1
#4 A 2018-4-1 1.2475237 1
#5 A 2018-5-1 -1.5479757 1
#6 B 2018-1-1 -0.6913331 0
#7 B 2018-2-1 -0.2423809 0
#8 B 2018-3-1 -1.6363024 0
#9 B 2018-4-1 -0.3256263 1
#10 B 2018-5-1 0.3563144 NA
您可以选择"窗口位置"。仔细看一下论点align = c("center", "left", "right")
。
因此作为dplyr链:
df %>% group_by(Item) %>% dplyr::mutate( new = zoo::rollsum( Value > 0, 3, fill = NA ))
您可以使用RcppRoll
包。
require(RcppRoll)
df$new <- df$new <- RcppRoll::roll_sum(df$Value > 0, 3, fill = NA)
使用Tidyverse:
df %>%
group_by(Item) %>%
dplyr::mutate(new = RcppRoll::roll_sum(Value > 0, 3, fill = NA))
在速度上,这比zoo
包更快:
n <- 10000
df <- data.frame(
Item = sample(LETTERS, n, replace = TRUE),
Value = rnorm(n))
df_grouped <- df %>%
group_by(Item)
microbenchmark::microbenchmark(
RcppRoll = df_grouped <- df_grouped %>% dplyr::mutate(new_RcppRoll = RcppRoll::roll_sum(Value > 0, 3, fill = NA)),
zoo = df_grouped <- df_grouped %>% dplyr::mutate(new_zoo = zoo::rollsum( Value > 0, 3, fill = NA ))
)
结果:
Unit: milliseconds
expr min lq mean median uq max neval
RcppRoll 2.509003 2.741993 2.929227 2.83913 2.983726 5.832962 100
zoo 11.172920 11.785113 13.288970 12.43320 13.607826 25.879754 100
和
all.equal(df_grouped$new_RcppRoll, df_grouped$new_zoo)
TRUE
Item Date Value
<fct> <date> <int>
1 A 2018-01-01 3
2 B 2018-01-01 2
3 B 2018-02-01 -5
4 A 2018-02-01 -3
5 A 2018-03-01 4
6 B 2018-03-01 -2
7 A 2018-04-01 5
8 B 2018-04-01 0
9 A 2018-05-01 1
10 B 2018-05-01 -4
为清晰起见,更改了rnorm示例,使用的样本(-5:5(:
> df <- df %>% mutate(greater_than = (Value>0)*Value) %>%
group_by(Item) %>% arrange(Date) %>% mutate(greater_than =
zoo::rollapplyr(greater_than, 3, sum, partial = T))
df %>% arrange(Item) %>% head(10)
应该是这样的:
1 A 2018-01-01 3 3
2 A 2018-02-01 -3 3
3 A 2018-03-01 4 7
4 A 2018-04-01 5 9
5 A 2018-05-01 1 10
6 B 2018-01-01 2 2
7 B 2018-02-01 -5 2
8 B 2018-03-01 -2 2
9 B 2018-04-01 0 0
10 B 2018-05-01 -4 0