我有一个包含事件开始和结束时间戳的数据框架。我想创建一个时间戳序列来填充这些事件之间的日期。下面是我的数据结构:
dat <- structure(list(event_id = 1:2,
start_time = structure(c(1617346800,1617348000),
class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end_time = structure(c(1617347400, 1617348300),
class = c("POSIXct", "POSIXt"), tzone = "UTC")),
class = "data.frame", row.names = c(NA, -2L))
我希望做的是延长数据帧,所以有尽可能多的行有两个事件之间的分钟。所以你会得到这样的结果:
final <- structure(list(event_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 2L, 2L, 2L, 2L,
2L), start_time = structure(c(1617346800, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1617348000,
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end_time = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1617347400, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1617348300), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
record_time = c("2/4/2021 7:00", "2/4/2021 7:01", "2/4/2021 7:02",
"2/4/2021 7:03", "2/4/2021 7:04", "2/4/2021 7:05", "2/4/2021 7:06",
"2/4/2021 7:07", "2/4/2021 7:08", "2/4/2021 7:09", "2/4/2021 7:10",
"2/4/2021 7:11", "2/4/2021 7:12", "2/4/2021 7:13", "2/4/2021 7:14",
"2/4/2021 7:15", "2/4/2021 7:16", "2/4/2021 7:17", "2/4/2021 7:18",
"2/4/2021 7:19", "2/4/2021 7:20", "2/4/2021 7:21", "2/4/2021 7:22",
"2/4/2021 7:23", "2/4/2021 7:24", "2/4/2021 7:25")), class = "data.frame", row.names = c(NA,
-26L))
到目前为止,我所提出的是创建一个完整序列的数据帧:
timeline <- as.POSIXct(
seq.POSIXt(from = min(dat$start_time),
to = max(dat$end_time), by = "min"))
从这里开始,我陷入了如何将两者结合起来的困境。使用for
循环,我可以制作序列列表,虽然在那一点上,我仍然不确定如何将这两件事结合在一起(也,我确信它可以用purrr
做得更好,但我对purrr
还不好)。
event_timelines <- list()
for (row in 1:nrow(dat)) {
event_timelines[[row]] <- seq.POSIXt(from = dat[row,]$start_time,
to = dat[row,]$end_time,
by = "min")
}
谢谢!
试试这个dplyr
方法(与fuzzyjoin
一起)。
library(dplyr)
timerange <- range(unlist(dat[,c("start_time", "end_time")]))
attributes(timerange) <- attributes(dat$start_time)
tibble(record_time = seq(timerange[1], timerange[2], by = "min")) %>%
fuzzyjoin::fuzzy_full_join(
dat,
by = c("record_time" = "start_time", "record_time" = "end_time"),
match_fun = list(`>=`, `<=`)
) %>%
group_by(event_id) %>%
mutate(
start_time = start_time[c(1, rep(NA, n()-1))],
end_time = end_time[c(rep(NA, n()-1), 1)]
) %>%
arrange(record_time) %>%
ungroup() %>%
print(n=99)
# # A tibble: 26 x 4
# record_time event_id start_time end_time
# <dttm> <int> <dttm> <dttm>
# 1 2021-04-02 07:00:00 1 2021-04-02 07:00:00 NA
# 2 2021-04-02 07:01:00 1 NA NA
# 3 2021-04-02 07:02:00 1 NA NA
# 4 2021-04-02 07:03:00 1 NA NA
# 5 2021-04-02 07:04:00 1 NA NA
# 6 2021-04-02 07:05:00 1 NA NA
# 7 2021-04-02 07:06:00 1 NA NA
# 8 2021-04-02 07:07:00 1 NA NA
# 9 2021-04-02 07:08:00 1 NA NA
# 10 2021-04-02 07:09:00 1 NA NA
# 11 2021-04-02 07:10:00 1 NA 2021-04-02 07:10:00
# 12 2021-04-02 07:11:00 NA NA NA
# 13 2021-04-02 07:12:00 NA NA NA
# 14 2021-04-02 07:13:00 NA NA NA
# 15 2021-04-02 07:14:00 NA NA NA
# 16 2021-04-02 07:15:00 NA NA NA
# 17 2021-04-02 07:16:00 NA NA NA
# 18 2021-04-02 07:17:00 NA NA NA
# 19 2021-04-02 07:18:00 NA NA NA
# 20 2021-04-02 07:19:00 NA NA NA
# 21 2021-04-02 07:20:00 2 2021-04-02 07:20:00 NA
# 22 2021-04-02 07:21:00 2 NA NA
# 23 2021-04-02 07:22:00 2 NA NA
# 24 2021-04-02 07:23:00 2 NA NA
# 25 2021-04-02 07:24:00 2 NA NA
# 26 2021-04-02 07:25:00 2 NA 2021-04-02 07:25:00
我选择使用fuzzyjoin
的原因是因为POSIXt
与numeric
类似,具有某些属性,有时(由于一般计算挑战,请参阅为什么这些数字不相等?)浮点数学坏了吗?,和https://en.wikipedia.org/wiki/IEEE_754), "浮点等号"并不总是有保证的。在这种情况下,fuzzyjoin
正在做一个基于范围的连接,实际上是record_time
在start_time
和end_time
之间。
使用dplyr
和tidyr
的方法:
使用complete
,fill
和NA
扩展序列,使用replace
调整输出,使其与预期输出相似。
library(dplyr)
library(tidyr)
dat %>%
mutate(record_time = start_time) %>%
complete(record_time = seq(min(start_time), max(end_time), by = '1 min')) %>%
fill(everything()) %>%
mutate(event_id = replace(event_id, record_time > end_time, NA),
across(c(start_time, end_time), ~replace(., . != record_time, NA)))
# record_time event_id start_time end_time
#1 2021-04-02 07:00:00 1 2021-04-02 07:00:00 <NA>
#2 2021-04-02 07:01:00 1 <NA> <NA>
#3 2021-04-02 07:02:00 1 <NA> <NA>
#4 2021-04-02 07:03:00 1 <NA> <NA>
#5 2021-04-02 07:04:00 1 <NA> <NA>
#6 2021-04-02 07:05:00 1 <NA> <NA>
#7 2021-04-02 07:06:00 1 <NA> <NA>
#8 2021-04-02 07:07:00 1 <NA> <NA>
#9 2021-04-02 07:08:00 1 <NA> <NA>
#10 2021-04-02 07:09:00 1 <NA> <NA>
#11 2021-04-02 07:10:00 1 <NA> 2021-04-02 07:10:00
#12 2021-04-02 07:11:00 NA <NA> <NA>
#13 2021-04-02 07:12:00 NA <NA> <NA>
#14 2021-04-02 07:13:00 NA <NA> <NA>
#15 2021-04-02 07:14:00 NA <NA> <NA>
#16 2021-04-02 07:15:00 NA <NA> <NA>
#17 2021-04-02 07:16:00 NA <NA> <NA>
#18 2021-04-02 07:17:00 NA <NA> <NA>
#19 2021-04-02 07:18:00 NA <NA> <NA>
#20 2021-04-02 07:19:00 NA <NA> <NA>
#21 2021-04-02 07:20:00 2 2021-04-02 07:20:00 <NA>
#22 2021-04-02 07:21:00 2 <NA> <NA>
#23 2021-04-02 07:22:00 2 <NA> <NA>
#24 2021-04-02 07:23:00 2 <NA> <NA>
#25 2021-04-02 07:24:00 2 <NA> <NA>
#26 2021-04-02 07:25:00 2 <NA> 2021-04-02 07:25:00
您可以坚持您的seq.POSIX
方法。使用by
遵循分割-应用-组合方法。对于每个事件id,计算序列。将结果乘以三,并使用match
,将旧列中的附加值设置为NA
。
res <- do.call(rbind, by(dat, dat$event_id, function(x) {
s1 <- s2 <- s3 <- do.call(seq, c(as.list(unname(x[-1])), "min"))
s1[-match(x[2], s1)] <- NA
s2[-match(x[3], s2)] <- NA
data.frame(event_id=as.integer(x[1]), start_time=s1, end_time=s2, record_time=s3)
}))
res
# event_id start_time end_time record_time
# 1.1 1 2021-04-02 07:00:00 <NA> 2021-04-02 07:00:00
# 1.2 1 <NA> <NA> 2021-04-02 07:01:00
# 1.3 1 <NA> <NA> 2021-04-02 07:02:00
# 1.4 1 <NA> <NA> 2021-04-02 07:03:00
# 1.5 1 <NA> <NA> 2021-04-02 07:04:00
# 1.6 1 <NA> <NA> 2021-04-02 07:05:00
# 1.7 1 <NA> <NA> 2021-04-02 07:06:00
# 1.8 1 <NA> <NA> 2021-04-02 07:07:00
# 1.9 1 <NA> <NA> 2021-04-02 07:08:00
# 1.10 1 <NA> <NA> 2021-04-02 07:09:00
# 1.11 1 <NA> 2021-04-02 07:10:00 2021-04-02 07:10:00
# 2.1 2 2021-04-02 07:20:00 <NA> 2021-04-02 07:20:00
# 2.2 2 <NA> <NA> 2021-04-02 07:21:00
# 2.3 2 <NA> <NA> 2021-04-02 07:22:00
# 2.4 2 <NA> <NA> 2021-04-02 07:23:00
# 2.5 2 <NA> <NA> 2021-04-02 07:24:00
# 2.6 2 <NA> 2021-04-02 07:25:00 2021-04-02 07:25:00