我有一个名为ActiveData
的数据帧,它显示了各个ID
处于活动状态的日期和时间(StartDT
和EndDT
)。Active is structured like this where
StartDTand
EndDT' 的格式为 mdy_hms:
ID StartDT EndDT
1 05/05/2021 8:15:00 05/05/2021 9:15:00
2 05/05/2021 8:15:00 05/05/2021 9:15:00
3 05/05/2021 8:15:00 05/05/2021 10:15:00
…
我有另一个名为Observations
的数据帧,它显示了每个ID
观察自己或另一个满足某个变量ID
观察。在这里,ID
表示观察者,IDobserved
表示观察到满足变量的ID
(ID
也可以观察自己)。
ID DT IDobserved
1 05/05/2021 8:19:00 1
1 05/05/2021 8:20:00 1
1 05/05/2021 8:19:00 2
2 05/05/2021 8:19:20 1
2 05/05/2021 8:19:45 3
3 05/05/2021 8:19:00 1
3 05/05/2021 8:20:00 1
3 05/05/2021 8:25:00 1
3 05/05/2021 8:45:00 3
3 05/05/2021 8:19:00 2
…
我想总结每个ID
在ActiveData
数据帧中由StartDT
和EndDT
指定的时间约束内观察到其他ID
(包括他们自己)满足变量的次数,以便最终表将指定观察次数,以及每个ID
正在积极观察的边界之间经过的时间量(以秒为单位)(在StartDT
和EndDT
之间 在ActiveData
)。因此,对于上面的数据,最终表如下所示:
ID IDobserved Observations TimeElapsed
1 1 2 3600
1 2 1 3600
2 1 1 3600
2 3 1 3600
3 1 3 7200
3 2 1 7200
3 3 1 7200
如何做到这一点?
很酷的问题!使用您的数据
ActiveData <- tibble::tribble(
~ID, ~StartDT, ~EndDT,
1, "05/05/2021 8:15:00", "05/05/2021 9:15:00",
2, "05/05/2021 8:15:00", "05/05/2021 9:15:00",
3, "05/05/2021 8:15:00", "05/05/2021 10:15:00"
)
Observations <- tibble::tribble(
~ID, ~DT, ~IDobserved,
1, "05/05/2021 8:19:00", 1,
1, "05/05/2021 8:20:00", 1,
1, "05/05/2021 8:19:00", 2,
2, "05/05/2021 8:19:20", 1,
2, "05/05/2021 8:19:45", 3,
3, "05/05/2021 8:19:00", 1,
3, "05/05/2021 8:20:00", 1,
3, "05/05/2021 8:25:00", 1,
3, "05/05/2021 8:45:00", 3,
3, "05/05/2021 8:19:00", 2
)
我会做的
library(dplyr)
fmt <- "%d/%m/%Y %H:%M:%S"
ActiveData %>%
mutate(across(-ID, ~ as.POSIXct(., format = fmt))) %>%
purrr::pmap((...) {
args <- list(...)
Observations %>%
mutate(DT = as.POSIXct(DT, format = fmt)) %>%
filter(DT >= args$StartDT, DT <= args$EndDT, ID == args$ID) %>%
count(ID, IDobserved, name = "Observations") %>%
mutate(TimeElapsed = difftime(args$EndDT,
args$StartDT,
units = "secs"))
}) %>%
bind_rows()
返回
# A tibble: 7 x 4
ID IDobserved Observations TimeElapsed
<dbl> <dbl> <int> <drtn>
1 1 1 2 3600 secs
2 1 2 1 3600 secs
3 2 1 1 3600 secs
4 2 3 1 3600 secs
5 3 1 3 7200 secs
6 3 2 1 7200 secs
7 3 3 1 7200 secs
这是一个使用data.table
的方法。
- 使用
setDT
将data.frame
转换为data.table
- 将两个数据集中的"DT"、"StartDT"、"EndDT"列转换为日期时间类 (
mdy_hms
) - 在第二个数据集 (df2) 中创建"观测值"列,作为每个组 ID 的观测值数,ID 观察到
- 与 ID 列上的第一个数据进行连接,通过使用
difftime
获取 DT、StartDT、EndDT 列之间的差值总和来指定、汇总并返回unique
行
library(data.table)
library(lubridate)
setDT(df1)[, c('StartDT', 'EndDT') := lapply(.SD, mdy_hms),
.SDcols = 2:3]
setDT(df2)[, DT := mdy_hms(DT)]
df2[, Observations := .N, .(ID, IDobserved)]
unique(df2[df1, .(IDobserved, Observations,
TimeElapsed = as.numeric(difftime(DT, StartDT, units = 'sec') +
difftime(EndDT, DT, units = 'sec'))), on = .(ID), by = .EACHI])
-输出
ID IDobserved Observations TimeElapsed
1: 1 1 2 3600
2: 1 2 1 3600
3: 2 1 1 3600
4: 2 3 1 3600
5: 3 1 3 7200
6: 3 3 1 7200
7: 3 2 1 7200
数据
df1 <- structure(list(ID = 1:3, StartDT = c("05/05/2021 8:15:00",
"05/05/2021 8:15:00",
"05/05/2021 8:15:00"), EndDT = c("05/05/2021 9:15:00", "05/05/2021 9:15:00",
"05/05/2021 10:15:00")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
DT = c("05/05/2021 8:19:00", "05/05/2021 8:20:00", "05/05/2021 8:19:00",
"05/05/2021 8:19:20", "05/05/2021 8:19:45", "05/05/2021 8:19:00",
"05/05/2021 8:20:00", "05/05/2021 8:25:00", "05/05/2021 8:45:00",
"05/05/2021 8:19:00"), IDobserved = c(1L, 1L, 2L, 1L, 3L,
1L, 1L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-10L))