根据附近的值创建新列,而不使用r中的for循环



在下面提供的数据框架中,我想包含一个列,该列的开始时间与前一行的结束时间之间存在差异。由于apply不允许使用索引,并且应该避免在R中使用for循环,因此我对如何构建这个函数已经没有想法了。下面是一个输入的示例,以及它在

结尾的样子
|      start_time     |      end_time       | Waiting_Time |
| ------------------- | ------------------- | ------------ |
| 1970-01-12 07:24:00 | 1970-01-12 07:24:00 |      0       |
| 1970-01-12 07:24:00 | 1970-01-12 07:30:00 |      0       |
| 1970-01-12 07:34:00 | 1970-01-12 07:47:00 |      4       |
| 1970-01-12 07:45:00 | 1970-01-12 07:45:00 |     15       |
| 1970-01-12 07:47:00 | 1970-01-12 07:52:00 |      2       |
| 1970-01-12 07:58:00 | 1970-01-12 07:58:00 |      6       |
| 1970-01-12 07:58:00 | 1970-01-12 08:12:00 |      0       |
| 1970-01-12 08:12:00 | 1970-01-12 07:30:00 |      0       |
| 1970-01-12 07:24:00 | 1970-01-12 08:20:00 |     72       |
| 1970-01-12 08:26:00 | 1970-01-12 08:26:00 |      6       |

如果开始时间在前一行的结束时间之前,则函数应该查找前两行(参见第4行和第9行中的示例)。

structure(list(Case_id = c(501L, 501L, 501L, 501L, 501L, 501L, 
501L, 501L, 501L, 501L, 501L, 501L, 501L, 501L, 501L), start_time = structure(c(977040, 
977040, 978300, 977640, 978420, 979080, 979080, 979920, 980760, 
980760, 981360, 982260, 982260, 985200, 985980), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), end_time = structure(c(977040, 977400, 
978300, 978420, 978720, 979080, 979920, 980400, 980760, 981360, 
981720, 982260, 985200, 985680, 985980), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Resource_id = c("System", "Tester5", 
"System", "SolverC2", "Tester3", "System", "SolverC1", "Tester2", 
"System", "SolverC1", "Tester5", "System", "SolverC3", "Tester1", 
"System"), Activity_id = c("Register", "Analyze Defect", "Inform User", 
"Repair (Complex)", "Test Repair", "Restart Repair", "Repair (Complex)", 
"Test Repair", "Restart Repair", "Repair (Complex)", "Test Repair", 
"Restart Repair", "Repair (Complex)", "Test Repair", "Archive Repair"
), Log = c("ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", 
"ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG"
), ExecTime = structure(c(0, 6, 0, 13, 5, 0, 14, 8, 0, 10, 6, 
0, 49, 8, 0), class = "difftime", units = "secs")), row.names = c(4121L, 
4122L, 4123L, 4124L, 4125L, 4126L, 4127L, 4129L, 4130L, 4132L, 
4133L, 4134L, 4135L, 4136L, 4137L), class = "data.frame")

您可以使用{dplyr}的lead()(或lag())函数访问之前(或之后)的行。

例如:

library(dplyr)
df %>% 
mutate(delta = start_time - lag(end_time)) %>% 
select(start_time, end_time, delta) # to truncate payload, remove in your case

这个收益率:

start_time            end_time     delta
4121 1970-01-12 07:24:00 1970-01-12 07:24:00   NA secs
4122 1970-01-12 07:24:00 1970-01-12 07:30:00    0 secs
4123 1970-01-12 07:45:00 1970-01-12 07:45:00  900 secs
4124 1970-01-12 07:34:00 1970-01-12 07:47:00 -660 secs
4125 1970-01-12 07:47:00 1970-01-12 07:52:00    0 secs
4126 1970-01-12 07:58:00 1970-01-12 07:58:00  360 secs
4127 1970-01-12 07:58:00 1970-01-12 08:12:00    0 secs
4129 1970-01-12 08:12:00 1970-01-12 08:20:00    0 secs
4130 1970-01-12 08:26:00 1970-01-12 08:26:00  360 secs
4132 1970-01-12 08:26:00 1970-01-12 08:36:00    0 secs
4133 1970-01-12 08:36:00 1970-01-12 08:42:00    0 secs
4134 1970-01-12 08:51:00 1970-01-12 08:51:00  540 secs
4135 1970-01-12 08:51:00 1970-01-12 09:40:00    0 secs
4136 1970-01-12 09:40:00 1970-01-12 09:48:00    0 secs
4137 1970-01-12 09:53:00 1970-01-12 09:53:00  300 secs

显然,第一个start_time没有前一个条目,因此结果是NA。您可能希望通过条件操作或将此值设置为零来处理这种情况。

要更好地控制你的时间增量,请阅读difftime(…),单位=…)。这里你可以设置单位为"分钟",如果这对你来说是一个更好的单位步长。

Base R选项使用difftime

df <- transform(df, Waiting_Time = c(0, difftime(start_time[-1] , 
end_time[-nrow(df)], units = "mins")))
df$Waiting_Time
#[1]   0   0  15 -11   0   6   0   0   6   0   0   9   0   0   5

最新更新