如何在R中同步具有部分不同时间戳的数据



我将来自同一时间事件的不同测量的数据存储在两个数据帧中。不过,两个数据帧中的时间戳并非100%相同。我想通过尽可能同步来合并这两个数据帧。

这是df1(下面的可复制数据(:

df1
rec_time   gsr
1  00:00:01.902 2.513
2  00:00:01.905 2.513
3  00:00:01.907 2.518
4  00:00:01.909 2.524
5  00:00:01.913 2.532
6  00:00:01.916 2.499
7  00:00:01.918 2.513
8  00:00:01.922 2.500
9  00:00:01.927 2.513
10 00:00:01.930 2.513
11 00:00:01.989 2.513
12 00:00:01.994 2.513
13 00:00:01.996 2.513
14 00:00:01.999 2.513
15 00:00:02.004 2.513

这是df2(下面的可复制数据(:

df2
Start_Time AOI
1 00:00:01.905 Iso
2 00:00:01.914 Kat
3 00:00:01.925 Iso
4 00:00:01.930 Iso
5 00:00:01.991 Kat

可以看出,有时时间戳(df1中的rec_timedf2中的Start_Time(显示相同的时间;尽管存在差异,但更多时候。现在,对于合并,如果存在差异,我想将df2中的行放在df1时间差最小的行上。如何实现这种同步?

预期结果如下:

df3
rec_time   gsr   Start_Time  AOI
1  00:00:01.902 2.513         <NA> <NA>
2  00:00:01.905 2.513 00:00:01.905  Iso   # same time
3  00:00:01.907 2.518         <NA> <NA>
4  00:00:01.909 2.524         <NA> <NA>
5  00:00:01.913 2.532 00:00:01.914  Kat   # different time
6  00:00:01.916 2.499         <NA> <NA>
7  00:00:01.918 2.513         <NA> <NA>
8  00:00:01.922 2.500         <NA> <NA>
9  00:00:01.927 2.513 00:00:01.925  Iso   # different time
10 00:00:01.930 2.513 00:00:01.930  Iso   # same time
11 00:00:01.989 2.513         <NA> <NA>
12 00:00:01.994 2.513 00:00:01.991  Kat   # different time
13 00:00:01.996 2.513         <NA> <NA>
14 00:00:01.999 2.513         <NA> <NA>
15 00:00:02.004 2.513         <NA> <NA>

到目前为止,我所尝试的是微不足道的:

merge(df1, df2, by.x = "rec_time", by.y = "Start_Time", all.x = T)

这显然只捕获那些具有相同时间戳的行。

可再现数据

df1 <- dput(gsr)
structure(list(rec_time = structure(1:15, .Label = c("00:00:01.902", 
"00:00:01.905", "00:00:01.907", "00:00:01.909", "00:00:01.913", 
"00:00:01.916", "00:00:01.918", "00:00:01.922", "00:00:01.927", 
"00:00:01.930", "00:00:01.989", "00:00:01.994", "00:00:01.996", 
"00:00:01.999", "00:00:02.004"), class = "factor"), gsr = c(2.513, 
2.513, 2.518, 2.524, 2.532, 2.499, 2.513, 2.5, 2.513, 2.513, 
2.513, 2.513, 2.513, 2.513, 2.513)), class = "data.frame", row.names = c(NA, 
-15L))
df2 <- dput(et)
structure(list(Start_Time = structure(1:5, .Label = c("00:00:01.905", 
"00:00:01.914", "00:00:01.925", "00:00:01.930", "00:00:01.991"
), class = "factor"), AOI = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("Iso", 
"Kat"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

我认为如果我们首先将时间戳转换为毫秒,那么处理数据会更容易:

library(dplyr)
timestamp_to_ms <- function(t) {
sapply(strsplit(as.character(t), ":"), function(x) {
1000 * sum(c(3600, 60, 1) * as.numeric(x))
})
}
df1$ms_time <- timestamp_to_ms(df1$rec_time)
df2$ms_time <- timestamp_to_ms(df2$Start_Time)

既然我们在两个数据帧中都有以毫秒为单位的时间,那么很容易找到最接近的匹配时间戳,并将其记录在df2:中

df2$ms_time <- sapply(df2$ms_time, function(x) {
df1$ms_time[which.min(abs(x - df1$ms_time))]
})

这样就可以实现标准left_join:

df3 <- left_join(df1, df2, by = "ms_time") %>% 
select(-ms_time)
df3
#>        rec_time   gsr   Start_Time  AOI
#> 1  00:00:01.902 2.513         <NA> <NA>
#> 2  00:00:01.905 2.513 00:00:01.905  Iso
#> 3  00:00:01.907 2.518         <NA> <NA>
#> 4  00:00:01.909 2.524         <NA> <NA>
#> 5  00:00:01.913 2.532 00:00:01.914  Kat
#> 6  00:00:01.916 2.499         <NA> <NA>
#> 7  00:00:01.918 2.513         <NA> <NA>
#> 8  00:00:01.922 2.500         <NA> <NA>
#> 9  00:00:01.927 2.513 00:00:01.925  Iso
#> 10 00:00:01.930 2.513 00:00:01.930  Iso
#> 11 00:00:01.989 2.513 00:00:01.991  Kat
#> 12 00:00:01.994 2.513         <NA> <NA>
#> 13 00:00:01.996 2.513         <NA> <NA>
#> 14 00:00:01.999 2.513         <NA> <NA>
#> 15 00:00:02.004 2.513         <NA> <NA>

由reprex包(v0.3.0(创建于2020-09-17

使用@Allan Cameron的建议,可以尝试library(fuzzyjoin)

不同的max_dist,可以得到不同精度的

difference_left_join(x = df1,
y = df2,
by = c("ms_time"),
max_dist = 2)
rec_time   gsr ms_time.x   Start_Time  AOI ms_time.y
1  00:00:01.902 2.513      1902         <NA> <NA>        NA
2  00:00:01.905 2.513      1905 00:00:01.905  Iso      1905
3  00:00:01.907 2.518      1907 00:00:01.905  Iso      1905
4  00:00:01.909 2.524      1909         <NA> <NA>        NA
5  00:00:01.913 2.532      1913 00:00:01.914  Kat      1914
6  00:00:01.916 2.499      1916 00:00:01.914  Kat      1914
7  00:00:01.918 2.513      1918         <NA> <NA>        NA
8  00:00:01.922 2.500      1922         <NA> <NA>        NA
9  00:00:01.927 2.513      1927 00:00:01.925  Iso      1925
10 00:00:01.930 2.513      1930 00:00:01.930  Iso      1930
11 00:00:01.989 2.513      1989 00:00:01.991  Kat      1991
12 00:00:01.994 2.513      1994         <NA> <NA>        NA
13 00:00:01.996 2.513      1996         <NA> <NA>        NA
14 00:00:01.999 2.513      1999         <NA> <NA>        NA
15 00:00:02.004 2.513      2004         <NA> <NA>        NA

最新更新