我有两个数据集(例如df
和df1
),我想对第一个数据集(df
)的观测值进行子集,该数据集具有第二个数据集(df1
)的观测日期范围。下面是示例数据集。
#Create variables
ID <- c(rep(c(1,2,3,4,5),times =3),rep(c(6,7,8),times=4),rep(9,times =2),10)
visit <- c(rep(1,times=5),rep(2,times=5),rep(3,times=5),rep(1, times=3),rep(2, times=3),
rep(3,times=3),rep(4,times=3),1,2,1)
StartDate <- as.Date(c("2020/01/01","2020/01/15","2020/01/29","2020/02/12","2020/02/26",
"2020/03/11","2020/04/01","2020/04/15","2020/04/29","2020/05/13",
"2020/05/27","2020/06/10","2020/06/24","2020/07/08","2020/07/22",
"2020/08/05","2020/08/19","2020/09/02","2020/09/16","2020/09/30",
"2020/10/14","2020/10/28","2020/11/11","2020/11/25","2020/12/09",
"2020/12/23","2021/01/06","2021/01/20","2021/02/03","2021/02/17"))
EndDate <- as.Date(c("2020/01/08","2020/01/22","2020/02/05","2020/02/19","2020/03/04",
"2020/03/25","2020/04/08","2020/04/22","2020/05/06","2020/05/20",
"2020/06/03","2020/06/17","2020/07/01","2020/07/15","2020/07/29",
"2020/08/12","2020/08/26","2020/09/09","2020/09/23","2020/10/07",
"2020/10/21","2020/11/04","2020/11/18","2020/12/02","2020/12/16",
"2020/12/30","2021/01/13","2021/01/27","2021/02/10","2021/02/24"))
#Create first data frame
df <- arrange(data.frame(ID,StartDate,EndDate,visit),ID)
df
ID StartDate EndDate visit
1 1 2020-01-01 2020-01-08 1
2 1 2020-03-11 2020-03-25 2
3 1 2020-05-27 2020-06-03 3
4 2 2020-01-15 2020-01-22 1
5 2 2020-04-01 2020-04-08 2
6 2 2020-06-10 2020-06-17 3
7 3 2020-01-29 2020-02-05 1
8 3 2020-04-15 2020-04-22 2
9 3 2020-06-24 2020-07-01 3
10 4 2020-02-12 2020-02-19 1
11 4 2020-04-29 2020-05-06 2
12 4 2020-07-08 2020-07-15 3
13 5 2020-02-26 2020-03-04 1
14 5 2020-05-13 2020-05-20 2
15 5 2020-07-22 2020-07-29 3
16 6 2020-08-05 2020-08-12 1
17 6 2020-09-16 2020-09-23 2
18 6 2020-10-28 2020-11-04 3
19 6 2020-12-09 2020-12-16 4
20 7 2020-08-19 2020-08-26 1
21 7 2020-09-30 2020-10-07 2
22 7 2020-11-11 2020-11-18 3
23 7 2020-12-23 2020-12-30 4
24 8 2020-09-02 2020-09-09 1
25 8 2020-10-14 2020-10-21 2
26 8 2020-11-25 2020-12-02 3
27 8 2021-01-06 2021-01-13 4
28 9 2021-01-20 2021-01-27 1
29 9 2021-02-03 2021-02-10 2
30 10 2021-02-17 2021-02-24 1
#Create second data frame
ID <- c(1,2,3,4,5,6,7,8,9,10)
visit <- rep(1,times=10)
ObservationDate<- as.Date(c("2020-03-19","2020-01-20","2020-06-25","2020-07-12",
"2020-02-26","2020-12-16","2020-10-05","2020-01-01",
"2021-01-27","2021-02-18"))
df1 <- data.frame(ID,ObservationDate,visit)
df1
ID ObservationDate visit
1 1 2020-03-19 1
2 2 2020-01-20 1
3 3 2020-06-25 1
4 4 2020-07-12 1
5 5 2020-02-26 1
6 6 2020-12-16 1
7 7 2020-10-05 1
8 8 2020-01-01 1
9 9 2021-01-27 1
10 10 2021-02-18 1
我尝试了以下方法建议从以前的问题张贴在这里我如何找到如果在第一个DF日期落在另一个数据帧的范围内?我做了以下操作,我得到了错误。
#create dummy variable indicating
library(tidyverse)
df %>%
mutate(across(c(ObservationDate, StartDate, EndDate), as.Date, format = "%m/%d/%Y")) %>%
mutate(dummy = ifelse(df1$ObservationDate >= StartDate& date <= EndDate, 1, NA))
我想做的是创建一个虚拟变量,当df1
的ObservationDate
落在df
的StartDate
和EndDate
区间内时,将1分配给观测值。我期望得到如下的输出:
ID StartDate EndDate visit dummy
1 1 2020-01-01 2020-01-08 1 NA
2 1 2020-03-11 2020-03-25 2 1
3 1 2020-05-27 2020-06-03 3 NA
4 2 2020-01-15 2020-01-22 1 1
5 2 2020-04-01 2020-04-08 2 NA
6 2 2020-06-10 2020-06-17 3 NA
7 3 2020-01-29 2020-02-05 1 NA
8 3 2020-04-15 2020-04-22 2 NA
9 3 2020-06-24 2020-07-01 3 1
10 4 2020-02-12 2020-02-19 1 NA
11 4 2020-04-29 2020-05-06 2 NA
12 4 2020-07-08 2020-07-15 3 1
13 5 2020-02-26 2020-03-04 1 1
14 5 2020-05-13 2020-05-20 2 NA
15 5 2020-07-22 2020-07-29 3 NA
16 6 2020-08-05 2020-08-12 1 NA
17 6 2020-09-16 2020-09-23 2 NA
18 6 2020-10-28 2020-11-04 3 NA
19 6 2020-12-09 2020-12-16 4 1
20 7 2020-08-19 2020-08-26 1 NA
21 7 2020-09-30 2020-10-07 2 1
22 7 2020-11-11 2020-11-18 3 NA
23 7 2020-12-23 2020-12-30 4 NA
24 8 2020-09-02 2020-09-09 1 NA
25 8 2020-10-14 2020-10-21 2 NA
26 8 2020-11-25 2020-12-02 3 NA
27 8 2021-01-06 2021-01-13 4 NA
28 9 2021-01-20 2021-01-27 1 1
29 9 2021-02-03 2021-02-10 2 NA
30 10 2021-02-17 2021-02-24 1 1
从这个输出中,我想使用代码subset(df,df$dummy==1)
来选择所需的数据。所需的总体输出如下所示。
ID StartDate EndDate visit dummy
1 1 2020-03-11 2020-03-25 2 1
2 2 2020-01-15 2020-01-22 1 1
3 3 2020-06-24 2020-07-01 3 1
4 4 2020-07-08 2020-07-15 3 1
5 5 2020-02-26 2020-03-04 1 1
6 6 2020-12-09 2020-12-16 4 1
7 7 2020-09-30 2020-10-07 2 1
8 9 2021-01-20 2021-01-27 1 1
9 10 2021-02-17 2021-02-24 1 1
我想在包含超过20,000个观测值的数据集上执行此操作。所以我正在寻求帮助,我如何才能达到上述结果,最好使用更短的方法。
提前谢谢你。
我想到了一个办法。我正在分享我的解决方案,以帮助其他可能面临类似问题的人。我按照以下步骤解决了这个问题。
步骤1:使用merge()
函数合并df
和df1
。
df2 <- merge(df,df1,by="ID")
df2
ID StartDate EndDate visit.x ObservationDate visit.y
1 1 2020-01-01 2020-01-08 1 2020-03-19 1
2 1 2020-03-11 2020-03-25 2 2020-03-19 1
3 1 2020-05-27 2020-06-03 3 2020-03-19 1
4 2 2020-01-15 2020-01-22 1 2020-01-20 1
5 2 2020-04-01 2020-04-08 2 2020-01-20 1
6 2 2020-06-10 2020-06-17 3 2020-01-20 1
7 3 2020-01-29 2020-02-05 1 2020-06-25 1
8 3 2020-04-15 2020-04-22 2 2020-06-25 1
9 3 2020-06-24 2020-07-01 3 2020-06-25 1
10 4 2020-02-12 2020-02-19 1 2020-07-12 1
11 4 2020-04-29 2020-05-06 2 2020-07-12 1
12 4 2020-07-08 2020-07-15 3 2020-07-12 1
步骤2:使用ifelse()
函数创建一个虚拟变量。
df2$dummy <- ifelse(df2$ObservationDate>=df2$StartDate & df2$ObservationDate<= df2$EndDate,1,0)
df2
ID StartDate EndDate visit.x ObservationDate visit.y dummy
1 1 2020-01-01 2020-01-08 1 2020-03-19 1 0
2 1 2020-03-11 2020-03-25 2 2020-03-19 1 1
3 1 2020-05-27 2020-06-03 3 2020-03-19 1 0
4 2 2020-01-15 2020-01-22 1 2020-01-20 1 1
5 2 2020-04-01 2020-04-08 2 2020-01-20 1 0
6 2 2020-06-10 2020-06-17 3 2020-01-20 1 0
7 3 2020-01-29 2020-02-05 1 2020-06-25 1 0
8 3 2020-04-15 2020-04-22 2 2020-06-25 1 0
9 3 2020-06-24 2020-07-01 3 2020-06-25 1 1
10 4 2020-02-12 2020-02-19 1 2020-07-12 1 0
步骤3:使用subset()
函数选择所需数据
subset(df2,df2$dummy==1)
ID StartDate EndDate visit.x ObservationDate visit.y dummy
2 1 2020-03-11 2020-03-25 2 2020-03-19 1 1
4 2 2020-01-15 2020-01-22 1 2020-01-20 1 1
9 3 2020-06-24 2020-07-01 3 2020-06-25 1 1
12 4 2020-07-08 2020-07-15 3 2020-07-12 1 1
13 5 2020-02-26 2020-03-04 1 2020-02-26 1 1
19 6 2020-12-09 2020-12-16 4 2020-12-16 1 1
21 7 2020-09-30 2020-10-07 2 2020-10-05 1 1
28 9 2021-01-20 2021-01-27 1 2021-01-27 1 1
30 10 2021-02-17 2021-02-24 1 2021-02-18 1 1