按ID和顺序对R中的表进行分组,不留空白

  • 本文关键字:空白 顺序 ID r plyr
  • 更新时间 :
  • 英文 :


我有一个虚构的医院数据表,当(不存在的)人员有医院转移时,需要将出院日期替换为最终出院日期。

rows <- sort(c(which(data$TRANSFER_NUM != 0), which(data$TRANSFER_NUM == 1)-1))
subset <- data[rows,]

令人讨厌的是,有些人可以为不同的事件进行多次传输,即

ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUMA 1992-12-04 3360 0A 1993-02-11 3361 1A 1993-03-10 3362 2A 1993-11-25 3363 3B 1987-05-15 3419 0B 1987-05-19 3420 1B 1990-02-03 3473 0B 1990-02-05 3474 1

这意味着

ddply(subset, "ID", mutate, max=max(DISCHARGE_DATE))

会给人B带来错误的结果,而正确的结果应该是:

ID放电_日期文件顺序转移_编号新放电_日期A 1992-12-04 3360 0 1993-11-25A 1993-02-11 3361 1 1993-11-25A 1993-03-10 3362 2 1993-11-25A 1993-11-25 3363 1993-11-25B 1987-05-15 3419 0 1987-05-19B 1987-05-19 3420 1 1987-0519B 1990-02-03 3473 0 1990-02-05B 1990-02-05 3474 1 1990-02-05

我想一些额外的分组可能会有所帮助,比如:

ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP NEW_DISCHARGE_dameA 1992-12-04 3360 0 1 1993-11-25A 1993-02-11 3361 1 1 1993-11-25A 1993-03-10 3362 2 1 1993-11-25A 1993-11-25 3363 1 1993-11-25B 1987-05-15 3419 0 1 1987-05-19B 1987-05-19 3420 1 1987-0519B 1990-02-03 3473 0 2 1990-02-05B 1990-02-05 3474 1 2 1990-02-05

如有任何帮助,我们将不胜感激!

您是对的,您需要一个中间分组列。这里有一个嵌套的ddply:

ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
#   ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1  A     1992-12-04          3360            0     0    1993-11-25
# 2  A     1993-02-11          3361            1     0    1993-11-25
# 3  A     1993-03-10          3362            2     0    1993-11-25
# 4  A     1993-11-25          3363            3     0    1993-11-25
# 5  B     1987-05-15          3419            0     0    1987-05-19
# 6  B     1987-05-19          3420            1     0    1987-05-19
# 7  B     1990-02-03          3473            0     1    1990-02-05
# 8  B     1990-02-05          3474            1     1    1990-02-05

try:

ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))

它确实假设TRANSFER_NUM是连续的,即1:x

根据评论,这是我得到的结果:

subset<-read.table(text="ID     DISCHARGE_DATE   FILE_SEQUENCE   TRANSFER_NUM
A      1992-12-04       3360            0
A      1993-02-11       3361            1
A      1993-03-10       3362            2
A      1993-11-25       3363            3
B      1987-05-15       3419            0
B      1987-05-19       3420            1
B      1990-02-03       3473            0
B      1990-02-05       3474            1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM        max
1   0  A     1992-12-04          3360            0 1993-11-25
2   0  A     1993-02-11          3361            1 1993-11-25
3   0  A     1993-03-10          3362            2 1993-11-25
4   0  A     1993-11-25          3363            3 1993-11-25
5  -6  B     1990-02-03          3473            0 1990-02-05
6  -6  B     1990-02-05          3474            1 1990-02-05
7  -4  B     1987-05-15          3419            0 1987-05-19
8  -4  B     1987-05-19          3420            1 1987-05-19

如果每个ID的grp子顺序有问题,那么只需更改grp定义前面的符号:

ddply(subset, .(ID,grp=-c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM        max
1   0  A     1992-12-04          3360            0 1993-11-25
2   0  A     1993-02-11          3361            1 1993-11-25
3   0  A     1993-03-10          3362            2 1993-11-25
4   0  A     1993-11-25          3363            3 1993-11-25
5   4  B     1987-05-15          3419            0 1987-05-19
6   4  B     1987-05-19          3420            1 1987-05-19
7   6  B     1990-02-03          3473            0 1990-02-05
8   6  B     1990-02-05          3474            1 1990-02-05

最新更新