我有两组数据:
第一组:
patient<-c("A","A","B","B","C","C","C","C")
arrival<-c("11:00","11:00","13:00","13:00","14:00","14:00","14:00","14:00")
lastRow<-c("","Yes","","Yes","","","","Yes")
data1<-data.frame(patient,arrival,lastRow)
另一组数据:
patient<-c("A","A","A","A","B","B","B","C","C","C")
availableSlot<-c("11:15","11:35","11:45","11:55","12:55","13:55","14:00","14:00","14:10","17:00")
data2<-data.frame(patient, availableSlot)
我想创建向第一个数据集添加一列,以便对于每个患者的最后一行,它显示可用的插槽最接近到达时间:
结果将是:
patient arrival lastRow availableSlot
A 11:00
A 11:00 Yes 11:15
B 13:00
B 13:00 Yes 12:55
C 14:00
C 14:00
C 14:00
C 14:00 Yes 14:00
如果有人能告诉我如何在 R 中实现这一点,将不胜感激。
我会使用 data.table,首先通过转换为 ITime 并忽略冗余行来清理:
library(data.table)
setDT(data1)[, arrival := as.ITime(as.character(arrival))]
setDT(data2)[, availableSlot := as.ITime(as.character(availableSlot))]
DT1 = unique(data1, by="patient", fromLast=TRUE)
然后,您可以执行"滚动连接":
res = data2[DT1, on=.(patient, availableSlot = arrival), roll="nearest",
.(patient, availableSlot = x.availableSlot)]
# patient availableSlot
# 1: A 11:15:00
# 2: B 12:55:00
# 3: C 14:00:00
工作原理
语法x[i, on=, roll=, j]
。
-
on=
是合并依据列。 - 这是一个连接:对于每一行
i
,我们正在寻找x
中的匹配项。 - 使用
roll="nearest"
,on=
中的最后一列被"滚动"到最接近的匹配。 - 原始表中的
on=
列可以使用x.*
和i.*
前缀进行引用。 j
参数应提供列列表,.()
是此处list()
的别名。
在 http://r-datatable.com/Getting-started 查看包的介绍性材料,并键入与滚动联接相关的文档?data.table
。
我会停在res
,但如果你真的想把它放回原来的桌子上......
# a very nonstandard step:
data1[lastRow == "Yes", availableSlot := res$availableSlot ]
# patient arrival lastRow availableSlot
# 1: A 11:00:00 <NA>
# 2: A 11:00:00 Yes 11:15:00
# 3: B 13:00:00 <NA>
# 4: B 13:00:00 Yes 12:55:00
# 5: C 14:00:00 <NA>
# 6: C 14:00:00 <NA>
# 7: C 14:00:00 <NA>
# 8: C 14:00:00 Yes 14:00:00
现在,data1
在新列中availableSlot
,类似于您执行data1$col <- val
时。
这是一个解决方案(基于 joel.wilson 对我问题的回答),适用于基本R
#Convert dates to POSIXct format
data1$arrival = as.POSIXct(data1$arrival, format = "%H:%M")
data2$availableSlot = as.POSIXct(data2$availableSlot, format = "%H:%M")
#Lookup times from data2$availableSlot closest to data1$arrival
data1$availableSlot = sapply(data1$arrival, function(x)
data2$availableSlot[which.min(abs(x - data2$availableSlot))])
#Keep just hour and minutes
data1$availableSlot = strftime(as.POSIXct(data1$availableSlot,
origin = "1970-01-01"), format = "%H:%M")
data1$arrival = strftime(as.POSIXct(data1$arrival), format = "%H:%M")
#Remove times when lastrow is empty
data1$availableSlot[which(data1$lastRow != "Yes")] = ""