我有一个数据帧,它有一个名为Product的列(包含许多产品(、一个名叫Timestamp的列(表示离散序数变量中的日期(和一个名叫Rating的列
我正在尝试计算每个产品的评级变量的移动平均值和移动标准差,同时考虑时间戳。
数据看起来像这样:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
现在我添加移动平均值和移动标准差的列:
DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))
最后,我使用嵌套for循环的代码来获得我想要的结果:
for (product in unique(DF$Product)) {
for (timestamp in DF[DF$Product==product,]$Timestamp){
if (timestamp==1) {
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Rating
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 0
}else{
index_start <- which(DF$Product==product &
DF$Timestamp==1)
index_end <- which(DF$Product==product &
DF$Timestamp==timestamp)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
mean(DF[index_start:index_end,]$Rating)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <-
sd(DF[index_start:index_end,]$Rating)
}
}
}
代码运行良好,但速度太慢。我想知道如何使用矢量化来加快速度?
如果你想在基R中进行矢量化,你可以尝试:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
cbind(DF, do.call(rbind, lapply(split(DF, DF$Product), function(x) {
do.call(rbind, lapply(seq(nrow(x)), function(y) {
c(Moving.avg = mean(x$Rating[1:y]), Moving.sd = sd(x$Rating[1:y]))}))})))
#> Product Timestamp Rating Moving.avg Moving.sd
#> 1 a 1 4 4.000000 NA
#> 2 a 2 3 3.500000 0.7071068
#> 3 a 3 5 4.000000 1.0000000
#> 4 a 4 3 3.750000 0.9574271
#> 5 b 1 3 3.000000 NA
#> 6 b 2 4 3.500000 0.7071068
#> 7 b 3 5 4.000000 1.0000000
#> 8 c 1 3 3.000000 NA
#> 9 c 2 1 2.000000 1.4142136
#> 10 c 3 1 1.666667 1.1547005
#> 11 c 4 2 1.750000 0.9574271
#> 12 c 5 5 2.400000 1.6733201
但是要注意,单个数字的sd
是NA
而不是0。如果DF$Moving.sd[is.na(DF$Moving.sd)] <- 0
需要,可以简单地替换这些
由reprex包(v0.3.0(于2020-08-31创建
我认为您正在寻找累积平均值和累积标准差。
对于累积平均值,可以使用cummean
函数,TTR::runSD
用于累积标准差。
library(dplyr)
DF %>%
group_by(Product) %>%
mutate(cum_avg = cummean(Rating),
cum_std = TTR::runSD(Rating, n = 1, cumulative = TRUE))
# Product Timestamp Rating cum_avg cum_std
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 4 4 NaN
# 2 a 2 3 3.5 0.707
# 3 a 3 5 4 1
# 4 a 4 3 3.75 0.957
# 5 b 1 3 3 NaN
# 6 b 2 4 3.5 0.707
# 7 b 3 5 4 1
# 8 c 1 3 3 NaN
# 9 c 2 1 2 1.41
#10 c 3 1 1.67 1.15
#11 c 4 2 1.75 0.957
#12 c 5 5 2.4 1.67
这个例子对你有用吗?这里我使用的是runner包中的函数runner((。runner((将在您定义的窗口上应用一个函数,并与dplyr中的group_by((函数配合良好。您可以在k参数上定义函数的窗口大小。
library(runner)
library(dplyr)
library(magrittr)
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
DF <- DF %>%
group_by(Product) %>%
arrange(Timestamp, .by_group = T)
DF <- DF %>%
mutate(
average = runner(Rating, f = function(x) mean(x), k = 3),
deviation = runner(Rating, f = function(x) sd(x), k = 3)
)
值得一提的是,该函数将扩展data.frame上每个组(或每个Product(的第一行上的窗口大小,直到达到k参数上定义的大小。因此,在前两行中,我们仍然没有之前的3个值,runner((将在这两行中应用该函数。
在这个相关问题的答案的基础上,您也可以用dplyr
:这样做
DF <- DF %>%
# Sort in order of product and then timestamp within product
arrange(Product, Timestamp) %>%
# group data by product
group_by(Product) %>%
# use the cumulative mean function to calculate the means
mutate(Moving.avg = cummean(Rating),
# use the map_dbl function to calculate standard deviations up to a certain index value
Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])),
# replace Moving.sd=0 when Timestamp takes on its smallest value
Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0,
TRUE ~ Moving.sd)) %>%
# ungroup the data
ungroup