r语言 - 在格式化为POSIXct的两个事件时间戳之间填充日期



我有一个包含事件开始和结束时间戳的数据框架。我想创建一个时间戳序列来填充这些事件之间的日期。下面是我的数据结构:

dat <- structure(list(event_id = 1:2,
start_time = structure(c(1617346800,1617348000),
class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
end_time = structure(c(1617347400, 1617348300),
class = c("POSIXct", "POSIXt"), tzone = "UTC")),
class = "data.frame", row.names = c(NA, -2L))

我希望做的是延长数据帧,所以有尽可能多的行有两个事件之间的分钟。所以你会得到这样的结果:

final <- structure(list(event_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 2L, 2L, 2L, 2L, 
2L), start_time = structure(c(1617346800, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1617348000, 
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
end_time = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 1617347400, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 1617348300), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
record_time = c("2/4/2021 7:00", "2/4/2021 7:01", "2/4/2021 7:02", 
"2/4/2021 7:03", "2/4/2021 7:04", "2/4/2021 7:05", "2/4/2021 7:06", 
"2/4/2021 7:07", "2/4/2021 7:08", "2/4/2021 7:09", "2/4/2021 7:10", 
"2/4/2021 7:11", "2/4/2021 7:12", "2/4/2021 7:13", "2/4/2021 7:14", 
"2/4/2021 7:15", "2/4/2021 7:16", "2/4/2021 7:17", "2/4/2021 7:18", 
"2/4/2021 7:19", "2/4/2021 7:20", "2/4/2021 7:21", "2/4/2021 7:22", 
"2/4/2021 7:23", "2/4/2021 7:24", "2/4/2021 7:25")), class = "data.frame", row.names = c(NA, 
-26L))

到目前为止,我所提出的是创建一个完整序列的数据帧:

timeline <- as.POSIXct(
seq.POSIXt(from = min(dat$start_time),
to = max(dat$end_time), by = "min"))

从这里开始,我陷入了如何将两者结合起来的困境。使用for循环,我可以制作序列列表,虽然在那一点上,我仍然不确定如何将这两件事结合在一起(也,我确信它可以用purrr做得更好,但我对purrr还不好)。

event_timelines <- list()
for (row in 1:nrow(dat)) {
event_timelines[[row]] <- seq.POSIXt(from = dat[row,]$start_time,
to = dat[row,]$end_time,
by = "min")
}

谢谢!

试试这个dplyr方法(与fuzzyjoin一起)。

library(dplyr)
timerange <- range(unlist(dat[,c("start_time", "end_time")]))
attributes(timerange) <- attributes(dat$start_time)
tibble(record_time = seq(timerange[1], timerange[2], by = "min")) %>%
fuzzyjoin::fuzzy_full_join(
dat,
by = c("record_time" = "start_time", "record_time" = "end_time"),
match_fun = list(`>=`, `<=`)
) %>%
group_by(event_id) %>%
mutate(
start_time = start_time[c(1, rep(NA, n()-1))],
end_time = end_time[c(rep(NA, n()-1), 1)]
) %>%
arrange(record_time) %>%
ungroup() %>%
print(n=99)
# # A tibble: 26 x 4
#    record_time         event_id start_time          end_time           
#    <dttm>                 <int> <dttm>              <dttm>             
#  1 2021-04-02 07:00:00        1 2021-04-02 07:00:00 NA                 
#  2 2021-04-02 07:01:00        1 NA                  NA                 
#  3 2021-04-02 07:02:00        1 NA                  NA                 
#  4 2021-04-02 07:03:00        1 NA                  NA                 
#  5 2021-04-02 07:04:00        1 NA                  NA                 
#  6 2021-04-02 07:05:00        1 NA                  NA                 
#  7 2021-04-02 07:06:00        1 NA                  NA                 
#  8 2021-04-02 07:07:00        1 NA                  NA                 
#  9 2021-04-02 07:08:00        1 NA                  NA                 
# 10 2021-04-02 07:09:00        1 NA                  NA                 
# 11 2021-04-02 07:10:00        1 NA                  2021-04-02 07:10:00
# 12 2021-04-02 07:11:00       NA NA                  NA                 
# 13 2021-04-02 07:12:00       NA NA                  NA                 
# 14 2021-04-02 07:13:00       NA NA                  NA                 
# 15 2021-04-02 07:14:00       NA NA                  NA                 
# 16 2021-04-02 07:15:00       NA NA                  NA                 
# 17 2021-04-02 07:16:00       NA NA                  NA                 
# 18 2021-04-02 07:17:00       NA NA                  NA                 
# 19 2021-04-02 07:18:00       NA NA                  NA                 
# 20 2021-04-02 07:19:00       NA NA                  NA                 
# 21 2021-04-02 07:20:00        2 2021-04-02 07:20:00 NA                 
# 22 2021-04-02 07:21:00        2 NA                  NA                 
# 23 2021-04-02 07:22:00        2 NA                  NA                 
# 24 2021-04-02 07:23:00        2 NA                  NA                 
# 25 2021-04-02 07:24:00        2 NA                  NA                 
# 26 2021-04-02 07:25:00        2 NA                  2021-04-02 07:25:00

我选择使用fuzzyjoin的原因是因为POSIXtnumeric类似,具有某些属性,有时(由于一般计算挑战,请参阅为什么这些数字不相等?)浮点数学坏了吗?,和https://en.wikipedia.org/wiki/IEEE_754), "浮点等号"并不总是有保证的。在这种情况下,fuzzyjoin正在做一个基于范围的连接,实际上是record_timestart_timeend_time之间。

使用dplyrtidyr的方法:

使用complete,fillNA扩展序列,使用replace调整输出,使其与预期输出相似。

library(dplyr)
library(tidyr)
dat %>%
mutate(record_time = start_time) %>%
complete(record_time = seq(min(start_time), max(end_time), by = '1 min')) %>%
fill(everything()) %>%
mutate(event_id = replace(event_id, record_time >  end_time, NA),
across(c(start_time, end_time), ~replace(., . != record_time, NA)))

#           record_time event_id          start_time            end_time
#1  2021-04-02 07:00:00        1 2021-04-02 07:00:00                <NA>
#2  2021-04-02 07:01:00        1                <NA>                <NA>
#3  2021-04-02 07:02:00        1                <NA>                <NA>
#4  2021-04-02 07:03:00        1                <NA>                <NA>
#5  2021-04-02 07:04:00        1                <NA>                <NA>
#6  2021-04-02 07:05:00        1                <NA>                <NA>
#7  2021-04-02 07:06:00        1                <NA>                <NA>
#8  2021-04-02 07:07:00        1                <NA>                <NA>
#9  2021-04-02 07:08:00        1                <NA>                <NA>
#10 2021-04-02 07:09:00        1                <NA>                <NA>
#11 2021-04-02 07:10:00        1                <NA> 2021-04-02 07:10:00
#12 2021-04-02 07:11:00       NA                <NA>                <NA>
#13 2021-04-02 07:12:00       NA                <NA>                <NA>
#14 2021-04-02 07:13:00       NA                <NA>                <NA>
#15 2021-04-02 07:14:00       NA                <NA>                <NA>
#16 2021-04-02 07:15:00       NA                <NA>                <NA>
#17 2021-04-02 07:16:00       NA                <NA>                <NA>
#18 2021-04-02 07:17:00       NA                <NA>                <NA>
#19 2021-04-02 07:18:00       NA                <NA>                <NA>
#20 2021-04-02 07:19:00       NA                <NA>                <NA>
#21 2021-04-02 07:20:00        2 2021-04-02 07:20:00                <NA>
#22 2021-04-02 07:21:00        2                <NA>                <NA>
#23 2021-04-02 07:22:00        2                <NA>                <NA>
#24 2021-04-02 07:23:00        2                <NA>                <NA>
#25 2021-04-02 07:24:00        2                <NA>                <NA>
#26 2021-04-02 07:25:00        2                <NA> 2021-04-02 07:25:00

您可以坚持您的seq.POSIX方法。使用by遵循分割-应用-组合方法。对于每个事件id,计算序列。将结果乘以三,并使用match,将旧列中的附加值设置为NA

res <- do.call(rbind, by(dat, dat$event_id, function(x) {
s1 <- s2 <- s3 <- do.call(seq, c(as.list(unname(x[-1])), "min"))
s1[-match(x[2], s1)] <- NA
s2[-match(x[3], s2)] <- NA
data.frame(event_id=as.integer(x[1]), start_time=s1, end_time=s2, record_time=s3)
}))
res
#      event_id          start_time            end_time         record_time
# 1.1         1 2021-04-02 07:00:00                <NA> 2021-04-02 07:00:00
# 1.2         1                <NA>                <NA> 2021-04-02 07:01:00
# 1.3         1                <NA>                <NA> 2021-04-02 07:02:00
# 1.4         1                <NA>                <NA> 2021-04-02 07:03:00
# 1.5         1                <NA>                <NA> 2021-04-02 07:04:00
# 1.6         1                <NA>                <NA> 2021-04-02 07:05:00
# 1.7         1                <NA>                <NA> 2021-04-02 07:06:00
# 1.8         1                <NA>                <NA> 2021-04-02 07:07:00
# 1.9         1                <NA>                <NA> 2021-04-02 07:08:00
# 1.10        1                <NA>                <NA> 2021-04-02 07:09:00
# 1.11        1                <NA> 2021-04-02 07:10:00 2021-04-02 07:10:00
# 2.1         2 2021-04-02 07:20:00                <NA> 2021-04-02 07:20:00
# 2.2         2                <NA>                <NA> 2021-04-02 07:21:00
# 2.3         2                <NA>                <NA> 2021-04-02 07:22:00
# 2.4         2                <NA>                <NA> 2021-04-02 07:23:00
# 2.5         2                <NA>                <NA> 2021-04-02 07:24:00
# 2.6         2                <NA> 2021-04-02 07:25:00 2021-04-02 07:25:00

最新更新