我有这样的数据:
id <- c(rep(1,5), rep(2,5), rep(3,4), rep(4,2), rep(5, 1))
year <- c(1990,1991,1992,1993,1994,1990,1991,1992,1993,1994,1990,1991,1992,1994,1990,1994, 1994)
gender <- c(rep("female", 5), rep("male", 5), rep("male", 4), rep("female", 2), rep("male", 1))
dat <- data.frame(id,year,gender)
正如你所看到的,在1990年至1994年间,id 1和id 2每年都有观测值,而在1990年和1994年间,对于id 3和id 4有缺失的观测值,最后,对于id 5只有一个观测值。
我想做的是复制id和gender列,并插入id 3和4的缺失观测值,这样就有1990年和1994年的观测值,而我不想对id 1、2或5做任何事情。有没有办法根据一个变量(如id(分组的两个数字之间存在差距的条件,创建一个从最老到最新观测的数字序列?
最终结果应该是这样的:
id year gender
<dbl> <dbl> <chr>
1 1 1990 female
2 1 1991 female
3 1 1992 female
4 1 1993 female
5 1 1994 female
6 2 1990 male
7 2 1991 male
8 2 1992 male
9 2 1993 male
10 2 1994 male
11 3 1990 male
12 3 1991 male
13 3 1992 male
14 3 1993 male
15 3 1994 male
16 4 1990 female
17 4 1991 female
18 4 1992 female
19 4 1993 female
20 4 1994 female
21 5 1994 male
过滤id 3和4的数据集,complete
他们的观察结果,并将数据绑定到id不是3和4时的其他id。
library(dplyr)
library(tidyr)
complete_id <- c(3, 4)
dat %>%
filter(id %in% complete_id) %>%
complete(id, year = 1990:1994) %>%
fill(gender) %>%
bind_rows(dat %>% filter(!id %in% complete_id)) %>%
arrange(id)
# id year gender
#1 1 1990 female
#2 1 1991 female
#3 1 1992 female
#4 1 1993 female
#5 1 1994 female
#6 2 1990 male
#7 2 1991 male
#8 2 1992 male
#9 2 1993 male
#10 2 1994 male
#11 3 1990 male
#12 3 1991 male
#13 3 1992 male
#14 3 1993 male
#15 3 1994 male
#16 4 1990 female
#17 4 1991 female
#18 4 1992 female
#19 4 1993 female
#20 4 1994 female
#21 5 1994 male