有一个数据帧,看起来像这样:
library(dplyr)
test <- data.frame("name" = c("Scott","Scott","Scott","Scott","Scott","Scott"),
"minutes" = c(100, 50, 150, 200, 100, 250),
"grade" = c(2, 1.5, 2.5, 3, 2.2, 2.8))
我想用cumsum为每一行做一个加权分数,它显示了他们用分钟数加权的平均分数——然而,我希望样本只包括占400分钟的最近一行。
以下是使用cumsum:的weighted_grade
代码示例
test <- test %>%
mutate(weighted_grade = cumsum(grade*minutes)/cumsum(minutes))
这对整个样本来说是一个很好的加权分数,但我只寻找占400分钟的最近一行。我研究了滚动和,但这些是基于行数而不是小时数。
为了清楚起见,我希望新列的前3行返回NA(因为前3行加起来有300分钟,因此不相关(;第4行将返回第2、3和4行的weighted_grade
(总共400分钟,因此第1行不相关(;第5行将返回第3、4和5行的weighted_grade
(450分钟(;等等…
1(rollapplyr按名称分组,然后对每个名称使用rollapplyr
。注意,宽度可以是我们使用findInterval
设置的向量。
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
test %>%
group_by(name) %>%
mutate(
minutes0 = ifelse(is.na(minutes), 0, minutes),
cumsum = cumsum(minutes0),
mean = rollapplyr(1:n(),
width = 1:n() - findInterval(cumsum - 400, cumsum),
FUN = function(ix) if (sum(minutes0[ix]) < 400) NA
else weighted.mean(grade[ix], minutes0[ix]),
fill = NA)) %>%
ungroup %>%
select(name, minutes, grade, mean)
给予:
# A tibble: 6 x 4
name minutes grade mean
<chr> <dbl> <dbl> <dbl>
1 Scott 100 2 NA
2 Scott 50 1.5 NA
3 Scott 150 2.5 NA
4 Scott 200 3 2.62
5 Scott 100 2.2 2.66
6 Scott 250 2.8 2.76
2(sqldf使用sql的方法是:
library(sqldf)
sqldf("with t1 as (
select rowid id, *, sum(minutes) over (partition by name rows unbounded preceding) as cum from test
)
select
a.name,
a.minutes,
a.grade,
iif (sum(b.minutes) < 400, Null, sum(b.grade * b.minutes) / sum(b.minutes)) as mean
from t1 a
left join t1 b on b.cum > a.cum - 400 and b.cum <= a.cum and a.name = b.name
group by a.id")
给予:
name minutes grade mean
1 Scott 100 2.0 NA
2 Scott 50 1.5 NA
3 Scott 150 2.5 NA
4 Scott 200 3.0 2.625000
5 Scott 100 2.2 2.655556
6 Scott 250 2.8 2.763636
更新
轻微的代码改进。
如果我正确理解
library(tidyverse)
library(zoo)
#>
#>
#> as.Date, as.Date.numeric
test <- data.frame("name" = c("Scott","Scott","Scott","Scott","Scott","Scott"),
"minutes" = c(100, 50, 150, 200, 100, 250),
"grade" = c(2, 1.5, 2.5, 3, 2.2, 2.8))
test %>%
mutate(numerator = grade * minutes,
cs_numerator = rollapply(numerator,
width = 3,
FUN = sum,
partial = T,
align = "right"),
cs_denominator = rollapply(minutes,
width = 3,
FUN = sum,
partial = T,
align = "right"),
res = ifelse(cs_denominator >= 400, cs_numerator / cs_denominator, NA))
#> name minutes grade numerator cs_numerator cs_denominator res
#> 1 Scott 100 2.0 200 200 100 NA
#> 2 Scott 50 1.5 75 275 150 NA
#> 3 Scott 150 2.5 375 650 300 NA
#> 4 Scott 200 3.0 600 1050 400 2.625000
#> 5 Scott 100 2.2 220 1195 450 2.655556
#> 6 Scott 250 2.8 700 1520 550 2.763636
由reprex包(v0.3.0(创建于2020-11-30