我有两个大型数据集,唯一共享的功能是数字时间戳。我想按这个时间戳合并数据帧,但数据收集的频率并不完全匹配,所以我需要允许它与最接近的匹配项合并。
作为一个简化的示例,这里有一个小数据集,其中包含一个值列、一些事件和一个ID:
a<-c("150", "164", "175", "183", "195", "200", "205","213")
b<-c("start1","end1","start2", "end2", "start1", "end1", "start2", "end2")
c<-c("A","A","A", "A", "B", "B", "B", "B")
(data<-data.table(value = a, event = b, ID = c))
我希望能够将这个";数据";用这个数值系列("次数"(乘以值列:
(times<-data.frame(value = c(seq(from = 150, to = 213, by = 3))))
因此,它们通过值列中最接近的近似匹配进行合并,以产生最终的数据帧:
agoal<-c(seq(from = 150, to = 213, by = 3))
bgoal<-c("start1","","","","","end1","", "",
"start2", "", "", "end2", "", "", "",
"start1", "", "end1", "start2", "", "", "end2")
cgoal<-c("A","","","","","A","", "",
"A", "", "", "A", "", "", "",
"B", "", "B", "B", "", "", "B")
(goal<-data.frame(value = agoal, event = bgoal, ID = cgoal))
有没有办法做到这一点,尤其是对于一个非常大的数据集(这样它就不会崩溃R(?
data.table
提供了滚动联接解决方案。
library(data.table)
setkey(data,value)
setkey(times,value)
data[times,roll = "nearest"]
# value event ID
# 1: 150 start1 A
# 2: 153 start1 A
# 3: 156 start1 A
# 4: 159 end1 A
# 5: 162 end1 A
# 6: 165 end1 A
# 7: 168 end1 A
# 8: 171 start2 A
# 9: 174 start2 A
#10: 177 start2 A
#11: 180 end2 A
#12: 183 end2 A
#13: 186 end2 A
#14: 189 end2 A
#15: 192 start1 B
#16: 195 start1 B
#17: 198 end1 B
#18: 201 end1 B
#19: 204 start2 B
#20: 207 start2 B
#21: 210 end2 B
#22: 213 end2 B
数据:
a<-c("150", "164", "175", "183", "195", "200", "205","213")
b<-c("start1","end1","start2", "end2", "start1", "end1", "start2", "end2")
c<-c("A","A","A", "A", "B", "B", "B", "B")
data<-data.table(value = as.numeric(a), event = b, ID = c)
times<-data.table(value = c(seq(from = 150, to = 213, by = 3)))
要通过最接近的匹配进行连接,而不需要用近似匹配来填补空白,fuzzyjoin效果很好!
(end<-fuzzyjoin::difference_left_join(times, data, by = "value", max_dist = 1, distance_col= "distance"))