数据:
DT<-data.table::data.table(
ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
C_OPR = c("ABCD01", "ABCD11", NA, "EFGH", NA, NA, "KLMN", NA),
D_OPR = c(NA, NA, "PQRST", NA, "EFGHIJ", NA, NA, NA),
DATE = c("2007-07-07","2005-05-05","2002-02-02",
"2002-02-02","2004-04-04",NA,"2001-01-01",NA),
INDX_DATE = c("2006-06-06","2006-06-06","2006-06-06",
"2001-01-01","2001-01-01","2001-01-01","2005-05-05",
"2005-05-05")
)
ALFA_DEF<-c("ABCD","EFGH")
输出:
ID C_OPR D_OPR DATE INDX_DATE
1: 1 ABCD01 <NA> 2007-07-07 2006-06-06
2: 1 ABCD11 <NA> 2005-05-05 2006-06-06
3: 1 <NA> PQRST 2002-02-02 2006-06-06
4: 2 EFGH <NA> 2002-02-02 2001-01-01
5: 2 <NA> EFGHIJ 2004-04-04 2001-01-01
6: 2 <NA> <NA> <NA> 2001-01-01
7: 3 KLMN <NA> 2001-01-01 2005-05-05
8: 3 <NA> <NA> <NA> 2005-05-05
期望输出:
ID C_OPR D_OPR DATE INDX_DATE ALFA
1: 1 ABCD01 <NA> 2007-07-07 2006-06-06 1
2: 1 ABCD11 <NA> 2005-05-05 2006-06-06 1
3: 1 <NA> PQRST 2002-02-02 2006-06-06 1
4: 2 EFGH <NA> 2002-02-02 2001-01-01 0
5: 2 <NA> EFGHIJ 2004-04-04 2001-01-01 0
6: 2 <NA> <NA> <NA> 2001-01-01 0
7: 3 KLMN <NA> 2001-01-01 2005-05-05 0
8: 3 <NA> <NA> <NA> 2005-05-05 0
逻辑:
在同一组(ID
)中,任何包含ALFA_DEF
的C_OPR
或D_OPR
,其中DATE
小于INDX_DATE
-,则生成ALFA
=1,否则为0。
尝试不更正日期:
DT[, ALPHA := +any( grepl(paste0(ALPHA_DEF, collapse="|"),c(D_OPR, C_OPR)), by=ID]
最好是data.table解决方案,但也欢迎dplyr。
/H
我们可以作为这样做
library(data.table)
pat <- paste(ALFA_DEF, collapse = "|")
DT[, ALFA := +(any(Reduce(`|`, lapply(.SD, (x)
grepl(pat, x[DATE < INDX_DATE]))))), by = ID, .SDcols = patterns("_OPR$")]
-输出
> DT
ID C_OPR D_OPR DATE INDX_DATE ALFA
1: 1 ABCD01 <NA> 2007-07-07 2006-06-06 1
2: 1 ABCD11 <NA> 2005-05-05 2006-06-06 1
3: 1 <NA> PQRST 2002-02-02 2006-06-06 1
4: 2 EFGH <NA> 2002-02-02 2001-01-01 0
5: 2 <NA> EFGHIJ 2004-04-04 2001-01-01 0
6: 2 <NA> <NA> <NA> 2001-01-01 0
7: 3 KLMN <NA> 2001-01-01 2005-05-05 0
8: 3 <NA> <NA> <NA> 2005-05-05 0
我们可以这样做:
创建新列后ALFA
我们检查a) 任何C_OPR
或D_OPR
都包含ALFA_DEF
b)DATE
<INDX_DATE
则CCD_ 14将是CCD_否则ALFA
将是0
DT[, ALFA := ifelse(any(grepl(paste0(ALFA_DEF, collapse = "|"), c(D_OPR, C_OPR)) & as.Date(DATE) < as.Date(INDX_DATE)), 1, 0), by = ID]
输出:
ID C_OPR D_OPR DATE INDX_DATE ALFA
1: 1 ABCD01 <NA> 2007-07-07 2006-06-06 1
2: 1 ABCD11 <NA> 2005-05-05 2006-06-06 1
3: 1 <NA> PQRST 2002-02-02 2006-06-06 1
4: 2 EFGH <NA> 2002-02-02 2001-01-01 0
5: 2 <NA> EFGHIJ 2004-04-04 2001-01-01 0
6: 2 <NA> <NA> <NA> 2001-01-01 0
7: 3 KLMN <NA> 2001-01-01 2005-05-05 0
8: 3 <NA> <NA> <NA> 2005-05-05 0