如何对第一个数据框中的观测值进行子集化,其开始日期和结束日期间隔(第二个数据框的观测日期落在 R 中)?



我有两个数据集(例如dfdf1),我想对第一个数据集(df)的观测值进行子集,该数据集具有第二个数据集(df1)的观测日期范围。下面是示例数据集。

#Create variables
ID <- c(rep(c(1,2,3,4,5),times =3),rep(c(6,7,8),times=4),rep(9,times =2),10)
visit <- c(rep(1,times=5),rep(2,times=5),rep(3,times=5),rep(1, times=3),rep(2, times=3),
rep(3,times=3),rep(4,times=3),1,2,1)
StartDate <- as.Date(c("2020/01/01","2020/01/15","2020/01/29","2020/02/12","2020/02/26",
"2020/03/11","2020/04/01","2020/04/15","2020/04/29","2020/05/13",
"2020/05/27","2020/06/10","2020/06/24","2020/07/08","2020/07/22",
"2020/08/05","2020/08/19","2020/09/02","2020/09/16","2020/09/30",
"2020/10/14","2020/10/28","2020/11/11","2020/11/25","2020/12/09",
"2020/12/23","2021/01/06","2021/01/20","2021/02/03","2021/02/17"))
EndDate <- as.Date(c("2020/01/08","2020/01/22","2020/02/05","2020/02/19","2020/03/04",
"2020/03/25","2020/04/08","2020/04/22","2020/05/06","2020/05/20",
"2020/06/03","2020/06/17","2020/07/01","2020/07/15","2020/07/29",
"2020/08/12","2020/08/26","2020/09/09","2020/09/23","2020/10/07",
"2020/10/21","2020/11/04","2020/11/18","2020/12/02","2020/12/16",
"2020/12/30","2021/01/13","2021/01/27","2021/02/10","2021/02/24"))
#Create first data frame 
df <- arrange(data.frame(ID,StartDate,EndDate,visit),ID)
df

ID  StartDate    EndDate visit
1   1 2020-01-01 2020-01-08     1
2   1 2020-03-11 2020-03-25     2
3   1 2020-05-27 2020-06-03     3
4   2 2020-01-15 2020-01-22     1
5   2 2020-04-01 2020-04-08     2
6   2 2020-06-10 2020-06-17     3
7   3 2020-01-29 2020-02-05     1
8   3 2020-04-15 2020-04-22     2
9   3 2020-06-24 2020-07-01     3
10  4 2020-02-12 2020-02-19     1
11  4 2020-04-29 2020-05-06     2
12  4 2020-07-08 2020-07-15     3
13  5 2020-02-26 2020-03-04     1
14  5 2020-05-13 2020-05-20     2
15  5 2020-07-22 2020-07-29     3
16  6 2020-08-05 2020-08-12     1
17  6 2020-09-16 2020-09-23     2
18  6 2020-10-28 2020-11-04     3
19  6 2020-12-09 2020-12-16     4
20  7 2020-08-19 2020-08-26     1
21  7 2020-09-30 2020-10-07     2
22  7 2020-11-11 2020-11-18     3
23  7 2020-12-23 2020-12-30     4
24  8 2020-09-02 2020-09-09     1
25  8 2020-10-14 2020-10-21     2
26  8 2020-11-25 2020-12-02     3
27  8 2021-01-06 2021-01-13     4
28  9 2021-01-20 2021-01-27     1
29  9 2021-02-03 2021-02-10     2
30 10 2021-02-17 2021-02-24     1 


#Create second data frame
ID <- c(1,2,3,4,5,6,7,8,9,10)
visit <- rep(1,times=10)
ObservationDate<- as.Date(c("2020-03-19","2020-01-20","2020-06-25","2020-07-12",
"2020-02-26","2020-12-16","2020-10-05","2020-01-01",
"2021-01-27","2021-02-18"))
df1 <- data.frame(ID,ObservationDate,visit)
df1

ID ObservationDate visit
1   1      2020-03-19     1
2   2      2020-01-20     1
3   3      2020-06-25     1
4   4      2020-07-12     1
5   5      2020-02-26     1
6   6      2020-12-16     1
7   7      2020-10-05     1
8   8      2020-01-01     1
9   9      2021-01-27     1
10 10      2021-02-18     1

我尝试了以下方法建议从以前的问题张贴在这里我如何找到如果在第一个DF日期落在另一个数据帧的范围内?我做了以下操作,我得到了错误。

#create dummy variable indicating 
library(tidyverse)
df %>% 
mutate(across(c(ObservationDate, StartDate, EndDate), as.Date, format = "%m/%d/%Y")) %>% 
mutate(dummy = ifelse(df1$ObservationDate >=  StartDate& date <= EndDate, 1, NA))

我想做的是创建一个虚拟变量,当df1ObservationDate落在dfStartDateEndDate区间内时,将1分配给观测值。我期望得到如下的输出:

ID  StartDate    EndDate visit dummy
1   1 2020-01-01 2020-01-08     1    NA
2   1 2020-03-11 2020-03-25     2     1
3   1 2020-05-27 2020-06-03     3    NA
4   2 2020-01-15 2020-01-22     1     1
5   2 2020-04-01 2020-04-08     2    NA
6   2 2020-06-10 2020-06-17     3    NA
7   3 2020-01-29 2020-02-05     1    NA
8   3 2020-04-15 2020-04-22     2    NA
9   3 2020-06-24 2020-07-01     3     1
10  4 2020-02-12 2020-02-19     1    NA
11  4 2020-04-29 2020-05-06     2    NA
12  4 2020-07-08 2020-07-15     3     1
13  5 2020-02-26 2020-03-04     1     1
14  5 2020-05-13 2020-05-20     2    NA
15  5 2020-07-22 2020-07-29     3    NA
16  6 2020-08-05 2020-08-12     1    NA
17  6 2020-09-16 2020-09-23     2    NA
18  6 2020-10-28 2020-11-04     3    NA
19  6 2020-12-09 2020-12-16     4     1
20  7 2020-08-19 2020-08-26     1    NA
21  7 2020-09-30 2020-10-07     2     1
22  7 2020-11-11 2020-11-18     3    NA
23  7 2020-12-23 2020-12-30     4    NA
24  8 2020-09-02 2020-09-09     1    NA
25  8 2020-10-14 2020-10-21     2    NA
26  8 2020-11-25 2020-12-02     3    NA
27  8 2021-01-06 2021-01-13     4    NA
28  9 2021-01-20 2021-01-27     1     1
29  9 2021-02-03 2021-02-10     2    NA
30 10 2021-02-17 2021-02-24     1     1

从这个输出中,我想使用代码subset(df,df$dummy==1)来选择所需的数据。所需的总体输出如下所示。

ID  StartDate    EndDate visit dummy
1  1 2020-03-11 2020-03-25     2     1
2  2 2020-01-15 2020-01-22     1     1
3  3 2020-06-24 2020-07-01     3     1
4  4 2020-07-08 2020-07-15     3     1
5  5 2020-02-26 2020-03-04     1     1
6  6 2020-12-09 2020-12-16     4     1
7  7 2020-09-30 2020-10-07     2     1
8  9 2021-01-20 2021-01-27     1     1
9 10 2021-02-17 2021-02-24     1     1

我想在包含超过20,000个观测值的数据集上执行此操作。所以我正在寻求帮助,我如何才能达到上述结果,最好使用更短的方法。

提前谢谢你。

我想到了一个办法。我正在分享我的解决方案,以帮助其他可能面临类似问题的人。我按照以下步骤解决了这个问题。

步骤1:使用merge()函数合并dfdf1

df2 <- merge(df,df1,by="ID")
df2
ID  StartDate    EndDate visit.x ObservationDate visit.y
1   1 2020-01-01 2020-01-08       1      2020-03-19       1
2   1 2020-03-11 2020-03-25       2      2020-03-19       1
3   1 2020-05-27 2020-06-03       3      2020-03-19       1
4   2 2020-01-15 2020-01-22       1      2020-01-20       1
5   2 2020-04-01 2020-04-08       2      2020-01-20       1
6   2 2020-06-10 2020-06-17       3      2020-01-20       1
7   3 2020-01-29 2020-02-05       1      2020-06-25       1
8   3 2020-04-15 2020-04-22       2      2020-06-25       1
9   3 2020-06-24 2020-07-01       3      2020-06-25       1
10  4 2020-02-12 2020-02-19       1      2020-07-12       1
11  4 2020-04-29 2020-05-06       2      2020-07-12       1
12  4 2020-07-08 2020-07-15       3      2020-07-12       1

步骤2:使用ifelse()函数创建一个虚拟变量。

df2$dummy <- ifelse(df2$ObservationDate>=df2$StartDate & df2$ObservationDate<= df2$EndDate,1,0)
df2
ID  StartDate    EndDate visit.x ObservationDate visit.y dummy
1   1 2020-01-01 2020-01-08       1      2020-03-19       1     0
2   1 2020-03-11 2020-03-25       2      2020-03-19       1     1
3   1 2020-05-27 2020-06-03       3      2020-03-19       1     0
4   2 2020-01-15 2020-01-22       1      2020-01-20       1     1
5   2 2020-04-01 2020-04-08       2      2020-01-20       1     0
6   2 2020-06-10 2020-06-17       3      2020-01-20       1     0
7   3 2020-01-29 2020-02-05       1      2020-06-25       1     0
8   3 2020-04-15 2020-04-22       2      2020-06-25       1     0
9   3 2020-06-24 2020-07-01       3      2020-06-25       1     1
10  4 2020-02-12 2020-02-19       1      2020-07-12       1     0

步骤3:使用subset()函数选择所需数据

subset(df2,df2$dummy==1)
ID  StartDate    EndDate visit.x ObservationDate visit.y dummy
2   1 2020-03-11 2020-03-25       2      2020-03-19       1     1
4   2 2020-01-15 2020-01-22       1      2020-01-20       1     1
9   3 2020-06-24 2020-07-01       3      2020-06-25       1     1
12  4 2020-07-08 2020-07-15       3      2020-07-12       1     1
13  5 2020-02-26 2020-03-04       1      2020-02-26       1     1
19  6 2020-12-09 2020-12-16       4      2020-12-16       1     1
21  7 2020-09-30 2020-10-07       2      2020-10-05       1     1
28  9 2021-01-20 2021-01-27       1      2021-01-27       1     1
30 10 2021-02-17 2021-02-24       1      2021-02-18       1     1

最新更新