我有一个车站维修的数据帧。
工作流程是这样的:机械师去一个车站,他们按下一个按钮,记录一个名为release
的action
。修复电台后,他们再次按下按钮,动作现在return
。
您可以在下面看到row 1
和row 2
是一个已完成的任务,花了Jane Jetson
10 秒完成。
dt name foo_id foo_role bikeId station_name station_id action
1 2019-12-12 13:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1234.89 Release
2 2019-12-12 13:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1234.89 Return
3 2019-12-12 13:06:16 John Doe 106338 Mechanic 12345 FooStation 1234.89 Release
4 2019-12-12 13:06:19 John Doe 106338 Mechanic 12345 FooStation 1234.89 Return
5 2019-12-12 13:07:16 John Doe 106338 Mechanic 12345 FooStation 1234.89 Release
6 2019-12-12 14:07:16 John Doe 106338 Mechanic 56789 Some Station 4567.12 Release
我想发生什么:
- 我想知道每个
mechanic
使用action
Release
修复车站,然后遵循Return
需要多长时间。 - 如果一个
Release
没有Return
,我想取Sys.time()
并从dt
中减去它。你会看到row 5
和row 6
我这样做了:(我不是 100% 确定我需要上一个操作,但我包括在内以防万一。
library(dplyr)
library(tidyr)
foo = arrange(foo, foo_id, name, foo_role, bikeId, station_id) %>%
group_by(foo_id,name, foo_role, bikeId, station_name,station_id) %>%
mutate(prev_dt = lag(dt, order_by = foo_id),
prev_action = lag(action, order_by=foo_id, default = 'NaN'))
foo$timediffsecs = as.numeric(difftime(foo$dt,foo$prev_dt,units='secs'))
> foo
# A tibble: 6 x 11
# Groups: foo_id, name, foo_role, bikeId, station_name, station_id [3]
dt name foo_id foo_role bikeId station_name station_id action prev_dt prev_action timediffsecs
<dttm> <fct> <int> <fct> <int> <fct> <dbl> <chr> <dttm> <chr> <dbl>
1 2019-12-12 13:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Release NA NaN NA
2 2019-12-12 13:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Return 2019-12-12 13:05:47 Release 10
3 2019-12-12 13:06:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release NA NaN NA
4 2019-12-12 13:06:19 John Doe 106338 Mechanic 12345 FooStation 1235. Return 2019-12-12 13:06:16 Release 3
5 2019-12-12 13:07:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 2019-12-12 13:06:19 Return 57
6 2019-12-12 14:07:16 John Doe 106338 Mechanic 56789 Some Station 4567. Release NA NaN NA
问题:
row 5
是一个新的周期,因为action
Release
和Return
以前发生过,但timediffsecs
记录了57秒。 在row 5
Prev_dt
和prev_action
应该NA
和timediffsecs
=Sys.time() - dt
.row 6
应该有timediffsecs
=Sys.time() - dt
我的想法可以工作:
我将prev_action
NA 更改为 NaN,所以我可以做一些 if else 语句,但我不太确定如何为此构建一个语句。我想将prev_dt
中的 NA 更改为默认为dt
但这样做时出现问题。我想尝试这样做的原因是我可以使用条件语句,但如果不需要,则无需更改 NA。
tl;DR:我希望timediffsecs
记录正确的秒数。row 5
和row 6
有问题。row 5
应该是Sys.time() - dt
.row 6
我想返回Sys.time() - dt
数据:
structure(list(dt = structure(c(1576173947, 1576173957, 1576173976,
1576173979, 1576174036, 1576177636), class = c("POSIXct", "POSIXt"
), tzone = ""), name = structure(c(1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Jane Jetson",
"John Doe"), class = "factor"), foo_id = c(106337L, 106337L,
106338L, 106338L, 106338L, 106338L), foo_role = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Mechanic", class = "factor"),
bikeId = c(12345L, 12345L, 12345L, 12345L, 12345L, 56789L
), station_name = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("FooStation",
"Some Station"), class = "factor"), station_id = c(1234.89,
1234.89, 1234.89, 1234.89, 1234.89, 4567.12), action = c("Release",
"Return", "Release", "Return", "Release", "Release")), row.names = c(NA,
-6L), class = "data.frame")
使用dplyr
的一种方法可能是为每个name
(或foo_id
(以及每次发生"Release"
时创建组。在这个组中,我们计算'Return'
和'Release'
时间之间的差异,如果存在'Return'
或'Release'
时间和当前时间之间的差异。
library(dplyr)
df %>%
group_by(name, group = cumsum(action == "Release")) %>%
mutate(timediffsecs = if (any(action == 'Return'))
dt[action == 'Return'] - dt[action == 'Release'] else Sys.time() - dt,
#If we want to replace Release values with NA
timediffsecs = replace(timediffsecs, n() > 1 & action == 'Release', NA))
# dt name foo_id foo_role bikeId station_name station_id action group timediffsecs
# <dttm> <fct> <int> <fct> <int> <fct> <dbl> <chr> <int> <drtn>
#1 2019-12-13 02:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Release 1 NA secs
#2 2019-12-13 02:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Return 1 10 secs
#3 2019-12-13 02:06:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 2 NA secs
#4 2019-12-13 02:06:19 John Doe 106338 Mechanic 12345 FooStation 1235. Return 2 3 secs
#5 2019-12-13 02:07:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 3 472603 secs
#6 2019-12-13 03:07:16 John Doe 106338 Mechanic 56789 Some Station 4567. Release 4 469003 secs
虽然这适用于给定的示例,但可能需要根据数据进行一些调整。