我将来自同一时间事件的不同测量的数据存储在两个数据帧中。不过,两个数据帧中的时间戳并非100%相同。我想通过尽可能同步来合并这两个数据帧。
这是df1
(下面的可复制数据(:
df1
rec_time gsr
1 00:00:01.902 2.513
2 00:00:01.905 2.513
3 00:00:01.907 2.518
4 00:00:01.909 2.524
5 00:00:01.913 2.532
6 00:00:01.916 2.499
7 00:00:01.918 2.513
8 00:00:01.922 2.500
9 00:00:01.927 2.513
10 00:00:01.930 2.513
11 00:00:01.989 2.513
12 00:00:01.994 2.513
13 00:00:01.996 2.513
14 00:00:01.999 2.513
15 00:00:02.004 2.513
这是df2
(下面的可复制数据(:
df2
Start_Time AOI
1 00:00:01.905 Iso
2 00:00:01.914 Kat
3 00:00:01.925 Iso
4 00:00:01.930 Iso
5 00:00:01.991 Kat
可以看出,有时时间戳(df1
中的rec_time
和df2
中的Start_Time
(显示相同的时间;尽管存在差异,但更多时候。现在,对于合并,如果存在差异,我想将df2
中的行放在df1
中时间差最小的行上。如何实现这种同步?
预期结果如下:
df3
rec_time gsr Start_Time AOI
1 00:00:01.902 2.513 <NA> <NA>
2 00:00:01.905 2.513 00:00:01.905 Iso # same time
3 00:00:01.907 2.518 <NA> <NA>
4 00:00:01.909 2.524 <NA> <NA>
5 00:00:01.913 2.532 00:00:01.914 Kat # different time
6 00:00:01.916 2.499 <NA> <NA>
7 00:00:01.918 2.513 <NA> <NA>
8 00:00:01.922 2.500 <NA> <NA>
9 00:00:01.927 2.513 00:00:01.925 Iso # different time
10 00:00:01.930 2.513 00:00:01.930 Iso # same time
11 00:00:01.989 2.513 <NA> <NA>
12 00:00:01.994 2.513 00:00:01.991 Kat # different time
13 00:00:01.996 2.513 <NA> <NA>
14 00:00:01.999 2.513 <NA> <NA>
15 00:00:02.004 2.513 <NA> <NA>
到目前为止,我所尝试的是微不足道的:
merge(df1, df2, by.x = "rec_time", by.y = "Start_Time", all.x = T)
这显然只捕获那些具有相同时间戳的行。
可再现数据:
df1 <- dput(gsr)
structure(list(rec_time = structure(1:15, .Label = c("00:00:01.902",
"00:00:01.905", "00:00:01.907", "00:00:01.909", "00:00:01.913",
"00:00:01.916", "00:00:01.918", "00:00:01.922", "00:00:01.927",
"00:00:01.930", "00:00:01.989", "00:00:01.994", "00:00:01.996",
"00:00:01.999", "00:00:02.004"), class = "factor"), gsr = c(2.513,
2.513, 2.518, 2.524, 2.532, 2.499, 2.513, 2.5, 2.513, 2.513,
2.513, 2.513, 2.513, 2.513, 2.513)), class = "data.frame", row.names = c(NA,
-15L))
df2 <- dput(et)
structure(list(Start_Time = structure(1:5, .Label = c("00:00:01.905",
"00:00:01.914", "00:00:01.925", "00:00:01.930", "00:00:01.991"
), class = "factor"), AOI = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("Iso",
"Kat"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
我认为如果我们首先将时间戳转换为毫秒,那么处理数据会更容易:
library(dplyr)
timestamp_to_ms <- function(t) {
sapply(strsplit(as.character(t), ":"), function(x) {
1000 * sum(c(3600, 60, 1) * as.numeric(x))
})
}
df1$ms_time <- timestamp_to_ms(df1$rec_time)
df2$ms_time <- timestamp_to_ms(df2$Start_Time)
既然我们在两个数据帧中都有以毫秒为单位的时间,那么很容易找到最接近的匹配时间戳,并将其记录在df2
:中
df2$ms_time <- sapply(df2$ms_time, function(x) {
df1$ms_time[which.min(abs(x - df1$ms_time))]
})
这样就可以实现标准left_join
:
df3 <- left_join(df1, df2, by = "ms_time") %>%
select(-ms_time)
df3
#> rec_time gsr Start_Time AOI
#> 1 00:00:01.902 2.513 <NA> <NA>
#> 2 00:00:01.905 2.513 00:00:01.905 Iso
#> 3 00:00:01.907 2.518 <NA> <NA>
#> 4 00:00:01.909 2.524 <NA> <NA>
#> 5 00:00:01.913 2.532 00:00:01.914 Kat
#> 6 00:00:01.916 2.499 <NA> <NA>
#> 7 00:00:01.918 2.513 <NA> <NA>
#> 8 00:00:01.922 2.500 <NA> <NA>
#> 9 00:00:01.927 2.513 00:00:01.925 Iso
#> 10 00:00:01.930 2.513 00:00:01.930 Iso
#> 11 00:00:01.989 2.513 00:00:01.991 Kat
#> 12 00:00:01.994 2.513 <NA> <NA>
#> 13 00:00:01.996 2.513 <NA> <NA>
#> 14 00:00:01.999 2.513 <NA> <NA>
#> 15 00:00:02.004 2.513 <NA> <NA>
由reprex包(v0.3.0(创建于2020-09-17
使用@Allan Cameron的建议,可以尝试library(fuzzyjoin)
不同的max_dist
,可以得到不同精度的
difference_left_join(x = df1,
y = df2,
by = c("ms_time"),
max_dist = 2)
rec_time gsr ms_time.x Start_Time AOI ms_time.y
1 00:00:01.902 2.513 1902 <NA> <NA> NA
2 00:00:01.905 2.513 1905 00:00:01.905 Iso 1905
3 00:00:01.907 2.518 1907 00:00:01.905 Iso 1905
4 00:00:01.909 2.524 1909 <NA> <NA> NA
5 00:00:01.913 2.532 1913 00:00:01.914 Kat 1914
6 00:00:01.916 2.499 1916 00:00:01.914 Kat 1914
7 00:00:01.918 2.513 1918 <NA> <NA> NA
8 00:00:01.922 2.500 1922 <NA> <NA> NA
9 00:00:01.927 2.513 1927 00:00:01.925 Iso 1925
10 00:00:01.930 2.513 1930 00:00:01.930 Iso 1930
11 00:00:01.989 2.513 1989 00:00:01.991 Kat 1991
12 00:00:01.994 2.513 1994 <NA> <NA> NA
13 00:00:01.996 2.513 1996 <NA> <NA> NA
14 00:00:01.999 2.513 1999 <NA> <NA> NA
15 00:00:02.004 2.513 2004 <NA> <NA> NA