在下面提供的数据框架中,我想包含一个列,该列的开始时间与前一行的结束时间之间存在差异。由于apply不允许使用索引,并且应该避免在R中使用for循环,因此我对如何构建这个函数已经没有想法了。下面是一个输入的示例,以及它在
结尾的样子| start_time | end_time | Waiting_Time |
| ------------------- | ------------------- | ------------ |
| 1970-01-12 07:24:00 | 1970-01-12 07:24:00 | 0 |
| 1970-01-12 07:24:00 | 1970-01-12 07:30:00 | 0 |
| 1970-01-12 07:34:00 | 1970-01-12 07:47:00 | 4 |
| 1970-01-12 07:45:00 | 1970-01-12 07:45:00 | 15 |
| 1970-01-12 07:47:00 | 1970-01-12 07:52:00 | 2 |
| 1970-01-12 07:58:00 | 1970-01-12 07:58:00 | 6 |
| 1970-01-12 07:58:00 | 1970-01-12 08:12:00 | 0 |
| 1970-01-12 08:12:00 | 1970-01-12 07:30:00 | 0 |
| 1970-01-12 07:24:00 | 1970-01-12 08:20:00 | 72 |
| 1970-01-12 08:26:00 | 1970-01-12 08:26:00 | 6 |
如果开始时间在前一行的结束时间之前,则函数应该查找前两行(参见第4行和第9行中的示例)。
structure(list(Case_id = c(501L, 501L, 501L, 501L, 501L, 501L,
501L, 501L, 501L, 501L, 501L, 501L, 501L, 501L, 501L), start_time = structure(c(977040,
977040, 978300, 977640, 978420, 979080, 979080, 979920, 980760,
980760, 981360, 982260, 982260, 985200, 985980), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), end_time = structure(c(977040, 977400,
978300, 978420, 978720, 979080, 979920, 980400, 980760, 981360,
981720, 982260, 985200, 985680, 985980), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Resource_id = c("System", "Tester5",
"System", "SolverC2", "Tester3", "System", "SolverC1", "Tester2",
"System", "SolverC1", "Tester5", "System", "SolverC3", "Tester1",
"System"), Activity_id = c("Register", "Analyze Defect", "Inform User",
"Repair (Complex)", "Test Repair", "Restart Repair", "Repair (Complex)",
"Test Repair", "Restart Repair", "Repair (Complex)", "Test Repair",
"Restart Repair", "Repair (Complex)", "Test Repair", "Archive Repair"
), Log = c("ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG",
"ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG"
), ExecTime = structure(c(0, 6, 0, 13, 5, 0, 14, 8, 0, 10, 6,
0, 49, 8, 0), class = "difftime", units = "secs")), row.names = c(4121L,
4122L, 4123L, 4124L, 4125L, 4126L, 4127L, 4129L, 4130L, 4132L,
4133L, 4134L, 4135L, 4136L, 4137L), class = "data.frame")
您可以使用{dplyr}的lead()(或lag())函数访问之前(或之后)的行。
例如:
library(dplyr)
df %>%
mutate(delta = start_time - lag(end_time)) %>%
select(start_time, end_time, delta) # to truncate payload, remove in your case
这个收益率:
start_time end_time delta
4121 1970-01-12 07:24:00 1970-01-12 07:24:00 NA secs
4122 1970-01-12 07:24:00 1970-01-12 07:30:00 0 secs
4123 1970-01-12 07:45:00 1970-01-12 07:45:00 900 secs
4124 1970-01-12 07:34:00 1970-01-12 07:47:00 -660 secs
4125 1970-01-12 07:47:00 1970-01-12 07:52:00 0 secs
4126 1970-01-12 07:58:00 1970-01-12 07:58:00 360 secs
4127 1970-01-12 07:58:00 1970-01-12 08:12:00 0 secs
4129 1970-01-12 08:12:00 1970-01-12 08:20:00 0 secs
4130 1970-01-12 08:26:00 1970-01-12 08:26:00 360 secs
4132 1970-01-12 08:26:00 1970-01-12 08:36:00 0 secs
4133 1970-01-12 08:36:00 1970-01-12 08:42:00 0 secs
4134 1970-01-12 08:51:00 1970-01-12 08:51:00 540 secs
4135 1970-01-12 08:51:00 1970-01-12 09:40:00 0 secs
4136 1970-01-12 09:40:00 1970-01-12 09:48:00 0 secs
4137 1970-01-12 09:53:00 1970-01-12 09:53:00 300 secs
显然,第一个start_time没有前一个条目,因此结果是NA。您可能希望通过条件操作或将此值设置为零来处理这种情况。
要更好地控制你的时间增量,请阅读difftime(…),单位=…)。这里你可以设置单位为"分钟",如果这对你来说是一个更好的单位步长。
Base R选项使用difftime
df <- transform(df, Waiting_Time = c(0, difftime(start_time[-1] ,
end_time[-nrow(df)], units = "mins")))
df$Waiting_Time
#[1] 0 0 15 -11 0 6 0 0 6 0 0 9 0 0 5