我有一个虚构的医院数据表,当(不存在的)人员有医院转移时,需要将出院日期替换为最终出院日期。
rows <- sort(c(which(data$TRANSFER_NUM != 0), which(data$TRANSFER_NUM == 1)-1))
subset <- data[rows,]
令人讨厌的是,有些人可以为不同的事件进行多次传输,即
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUMA 1992-12-04 3360 0A 1993-02-11 3361 1A 1993-03-10 3362 2A 1993-11-25 3363 3B 1987-05-15 3419 0B 1987-05-19 3420 1B 1990-02-03 3473 0B 1990-02-05 3474 1
这意味着
ddply(subset, "ID", mutate, max=max(DISCHARGE_DATE))
会给人B带来错误的结果,而正确的结果应该是:
ID放电_日期文件顺序转移_编号新放电_日期A 1992-12-04 3360 0 1993-11-25A 1993-02-11 3361 1 1993-11-25A 1993-03-10 3362 2 1993-11-25A 1993-11-25 3363 1993-11-25B 1987-05-15 3419 0 1987-05-19B 1987-05-19 3420 1 1987-0519B 1990-02-03 3473 0 1990-02-05B 1990-02-05 3474 1 1990-02-05
我想一些额外的分组可能会有所帮助,比如:
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP NEW_DISCHARGE_dameA 1992-12-04 3360 0 1 1993-11-25A 1993-02-11 3361 1 1 1993-11-25A 1993-03-10 3362 2 1 1993-11-25A 1993-11-25 3363 1 1993-11-25B 1987-05-15 3419 0 1 1987-05-19B 1987-05-19 3420 1 1987-0519B 1990-02-03 3473 0 2 1990-02-05B 1990-02-05 3474 1 2 1990-02-05
如有任何帮助,我们将不胜感激!
您是对的,您需要一个中间分组列。这里有一个嵌套的ddply
:
ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
# ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1 A 1992-12-04 3360 0 0 1993-11-25
# 2 A 1993-02-11 3361 1 0 1993-11-25
# 3 A 1993-03-10 3362 2 0 1993-11-25
# 4 A 1993-11-25 3363 3 0 1993-11-25
# 5 B 1987-05-15 3419 0 0 1987-05-19
# 6 B 1987-05-19 3420 1 0 1987-05-19
# 7 B 1990-02-03 3473 0 1 1990-02-05
# 8 B 1990-02-05 3474 1 1 1990-02-05
try:
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
它确实假设TRANSFER_NUM是连续的,即1:x
根据评论,这是我得到的结果:
subset<-read.table(text="ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM
A 1992-12-04 3360 0
A 1993-02-11 3361 1
A 1993-03-10 3362 2
A 1993-11-25 3363 3
B 1987-05-15 3419 0
B 1987-05-19 3420 1
B 1990-02-03 3473 0
B 1990-02-05 3474 1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 -6 B 1990-02-03 3473 0 1990-02-05
6 -6 B 1990-02-05 3474 1 1990-02-05
7 -4 B 1987-05-15 3419 0 1987-05-19
8 -4 B 1987-05-19 3420 1 1987-05-19
如果每个ID的grp子顺序有问题,那么只需更改grp定义前面的符号:
ddply(subset, .(ID,grp=-c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 4 B 1987-05-15 3419 0 1987-05-19
6 4 B 1987-05-19 3420 1 1987-05-19
7 6 B 1990-02-03 3473 0 1990-02-05
8 6 B 1990-02-05 3474 1 1990-02-05