加权累算与另一列滚动

  • 本文关键字:一列 滚动 加权 r
  • 更新时间 :
  • 英文 :


有一个数据帧,看起来像这样:

library(dplyr)
test <- data.frame("name" = c("Scott","Scott","Scott","Scott","Scott","Scott"),
"minutes" = c(100, 50, 150, 200, 100, 250),
"grade" = c(2, 1.5, 2.5, 3, 2.2, 2.8))

我想用cumsum为每一行做一个加权分数,它显示了他们用分钟数加权的平均分数——然而,我希望样本只包括占400分钟的最近一行。

以下是使用cumsum:的weighted_grade代码示例

test <- test %>%
mutate(weighted_grade = cumsum(grade*minutes)/cumsum(minutes))

这对整个样本来说是一个很好的加权分数,但我只寻找占400分钟的最近一行。我研究了滚动和,但这些是基于行数而不是小时数。

为了清楚起见,我希望新列的前3行返回NA(因为前3行加起来有300分钟,因此不相关(;第4行将返回第2、3和4行的weighted_grade(总共400分钟,因此第1行不相关(;第5行将返回第3、4和5行的weighted_grade(450分钟(;等等…

1(rollapplyr按名称分组,然后对每个名称使用rollapplyr。注意,宽度可以是我们使用findInterval设置的向量。

library(dplyr, exclude = c("filter", "lag"))
library(zoo)
test %>%
group_by(name) %>%
mutate(
minutes0 = ifelse(is.na(minutes), 0, minutes),
cumsum = cumsum(minutes0),
mean = rollapplyr(1:n(),
width = 1:n() - findInterval(cumsum - 400, cumsum),
FUN = function(ix) if (sum(minutes0[ix]) < 400) NA
else weighted.mean(grade[ix], minutes0[ix]),
fill = NA)) %>%
ungroup %>%
select(name, minutes, grade, mean)

给予:

# A tibble: 6 x 4
name  minutes grade  mean
<chr>   <dbl> <dbl> <dbl>
1 Scott     100   2   NA   
2 Scott      50   1.5 NA   
3 Scott     150   2.5 NA   
4 Scott     200   3    2.62
5 Scott     100   2.2  2.66
6 Scott     250   2.8  2.76

2(sqldf使用sql的方法是:

library(sqldf)
sqldf("with t1 as (
select rowid id, *, sum(minutes) over (partition by name rows unbounded preceding) as cum from test
)   
select 
a.name, 
a.minutes, 
a.grade, 
iif (sum(b.minutes) < 400, Null, sum(b.grade * b.minutes) / sum(b.minutes)) as mean
from t1 a 
left join t1 b on b.cum > a.cum  - 400 and b.cum <= a.cum and a.name = b.name
group by a.id")

给予:

name minutes grade     mean
1 Scott     100   2.0       NA
2 Scott      50   1.5       NA
3 Scott     150   2.5       NA
4 Scott     200   3.0 2.625000
5 Scott     100   2.2 2.655556
6 Scott     250   2.8 2.763636

更新

轻微的代码改进。

如果我正确理解

library(tidyverse)
library(zoo)
#> 
#> 
#>     as.Date, as.Date.numeric
test <- data.frame("name" = c("Scott","Scott","Scott","Scott","Scott","Scott"),
"minutes" = c(100, 50, 150, 200, 100, 250),
"grade" = c(2, 1.5, 2.5, 3, 2.2, 2.8))
test %>%
mutate(numerator = grade * minutes,
cs_numerator = rollapply(numerator,
width = 3,
FUN = sum,
partial = T,
align = "right"),
cs_denominator = rollapply(minutes,
width = 3,
FUN = sum,
partial = T,
align = "right"),
res = ifelse(cs_denominator >= 400, cs_numerator / cs_denominator, NA))
#>    name minutes grade numerator cs_numerator cs_denominator      res
#> 1 Scott     100   2.0       200          200            100       NA
#> 2 Scott      50   1.5        75          275            150       NA
#> 3 Scott     150   2.5       375          650            300       NA
#> 4 Scott     200   3.0       600         1050            400 2.625000
#> 5 Scott     100   2.2       220         1195            450 2.655556
#> 6 Scott     250   2.8       700         1520            550 2.763636

由reprex包(v0.3.0(创建于2020-11-30

最新更新