r-如何对嵌套循环进行矢量化并更新数据帧



我有一个数据帧,它有一个名为Product的列(包含许多产品(、一个名叫Timestamp的列(表示离散序数变量中的日期(和一个名叫Rating的列
我正在尝试计算每个产品的评级变量的移动平均值和移动标准差,同时考虑时间戳。

数据看起来像这样:

DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))

现在我添加移动平均值和移动标准差的列:

DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))

最后,我使用嵌套for循环的代码来获得我想要的结果:

for (product in unique(DF$Product)) {
for (timestamp in DF[DF$Product==product,]$Timestamp){
if (timestamp==1) {
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <- 
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Rating
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 0
}else{
index_start <- which(DF$Product==product &
DF$Timestamp==1)
index_end <- which(DF$Product==product &
DF$Timestamp==timestamp)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <- 
mean(DF[index_start:index_end,]$Rating)

DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 
sd(DF[index_start:index_end,]$Rating)
}
}
} 

代码运行良好,但速度太慢。我想知道如何使用矢量化来加快速度?

如果你想在基R中进行矢量化,你可以尝试:

DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
cbind(DF, do.call(rbind, lapply(split(DF, DF$Product), function(x) {
do.call(rbind, lapply(seq(nrow(x)), function(y) {
c(Moving.avg = mean(x$Rating[1:y]), Moving.sd = sd(x$Rating[1:y]))}))})))
#>    Product Timestamp Rating Moving.avg Moving.sd
#> 1        a         1      4   4.000000        NA
#> 2        a         2      3   3.500000 0.7071068
#> 3        a         3      5   4.000000 1.0000000
#> 4        a         4      3   3.750000 0.9574271
#> 5        b         1      3   3.000000        NA
#> 6        b         2      4   3.500000 0.7071068
#> 7        b         3      5   4.000000 1.0000000
#> 8        c         1      3   3.000000        NA
#> 9        c         2      1   2.000000 1.4142136
#> 10       c         3      1   1.666667 1.1547005
#> 11       c         4      2   1.750000 0.9574271
#> 12       c         5      5   2.400000 1.6733201

但是要注意,单个数字的sdNA而不是0。如果DF$Moving.sd[is.na(DF$Moving.sd)] <- 0需要,可以简单地替换这些

由reprex包(v0.3.0(于2020-08-31创建

我认为您正在寻找累积平均值和累积标准差。

对于累积平均值,可以使用cummean函数,TTR::runSD用于累积标准差。

library(dplyr)
DF %>%
group_by(Product) %>%
mutate(cum_avg = cummean(Rating), 
cum_std = TTR::runSD(Rating, n = 1, cumulative = TRUE))
#  Product Timestamp Rating cum_avg cum_std
#   <chr>       <dbl>  <dbl>   <dbl>   <dbl>
# 1 a               1      4    4    NaN    
# 2 a               2      3    3.5    0.707
# 3 a               3      5    4      1    
# 4 a               4      3    3.75   0.957
# 5 b               1      3    3    NaN    
# 6 b               2      4    3.5    0.707
# 7 b               3      5    4      1    
# 8 c               1      3    3    NaN    
# 9 c               2      1    2      1.41 
#10 c               3      1    1.67   1.15 
#11 c               4      2    1.75   0.957
#12 c               5      5    2.4    1.67 

这个例子对你有用吗?这里我使用的是runner包中的函数runner((。runner((将在您定义的窗口上应用一个函数,并与dplyr中的group_by((函数配合良好。您可以在k参数上定义函数的窗口大小。

library(runner)
library(dplyr)
library(magrittr)
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))

DF <- DF %>% 
group_by(Product) %>% 
arrange(Timestamp, .by_group = T)

DF <- DF %>% 
mutate(
average = runner(Rating, f = function(x) mean(x), k = 3),
deviation = runner(Rating, f = function(x) sd(x), k = 3)
)

值得一提的是,该函数将扩展data.frame上每个组(或每个Product(的第一行上的窗口大小,直到达到k参数上定义的大小。因此,在前两行中,我们仍然没有之前的3个值,runner((将在这两行中应用该函数。

在这个相关问题的答案的基础上,您也可以用dplyr:这样做

DF <- DF %>% 
# Sort in order of product and then timestamp within product 
arrange(Product, Timestamp) %>% 
# group data by product
group_by(Product) %>% 
# use the cumulative mean function to calculate the means 
mutate(Moving.avg = cummean(Rating), 
# use the map_dbl function to calculate standard deviations up to a certain index value       
Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])), 
# replace Moving.sd=0 when Timestamp takes on its smallest value
Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0, 
TRUE ~ Moving.sd)) %>%
# ungroup the data
ungroup

最新更新