我很难弄清楚如何在R中执行以下操作。想象一下以下数据集:
pdata <- tibble(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
基本上,我想做的是添加一个计算列,它将看到行中的ID,并在给定时间在该行之前的情况下对该ID的值求和。例如,在第3行中,它会看到id 1有两个比2018年旧的记录,所以它将它们相加为2。因此对于第3行,新计算的列将具有值2。下面是我需要的例子。
# A tibble: 50 × 4
id time value OUTPUT
<int> <int> <dbl> <dbl>
1 1 2016 1 0
2 1 2017 1 1
3 1 2018 1 2
4 1 2019 0 3
5 1 2020 0 3
6 2 2016 1 0
7 2 2017 1 1
8 2 2018 0 2
9 2 2019 0 2
10 2 2020 0 2
# … with 40 more rows
谢谢!
按id
和time
排序,按id
分组,并计算滞后值的累积和。
suppressPackageStartupMessages(library(dplyr))
pdata <- tibble(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
pdata %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(OUTPUT = cumsum(lag(value, default = 0))) %>%
ungroup()
#> # A tibble: 50 × 4
#> id time value OUTPUT
#> <int> <int> <dbl> <dbl>
#> 1 1 2016 1 0
#> 2 1 2017 1 1
#> 3 1 2018 1 2
#> 4 1 2019 0 3
#> 5 1 2020 0 3
#> 6 2 2016 1 0
#> 7 2 2017 1 1
#> 8 2 2018 0 2
#> 9 2 2019 0 2
#> 10 2 2020 0 2
#> # … with 40 more rows
创建于2022-09-11由reprex包(v2.0.1(
library(tidyverse)
df <- data.frame(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
df1 <- df %>%
group_by(id) %>%
mutate(output = case_when(time < 2018 ~ 1,
TRUE ~ 0)) %>%
mutate(output = cumsum(lag(output, default = 0)))
控制台:
id time value output
<int> <int> <dbl> <dbl>
1 1 2016 1 0
2 1 2017 1 1
3 1 2018 1 2
4 1 2019 0 2
5 1 2020 0 2
6 2 2016 1 0
7 2 2017 1 1
8 2 2018 0 2
9 2 2019 0 2
10 2 2020 0 2