在我的报告中,每一行都有一个ID,一个状态和一个活动日期时间;由此,我需要计算出一个ID的两个相关状态之间的持续时间。例如,数据可能如下所示:
ID | status | activitydatetime
123456 | Account Ready | 04-25-2022 06:30
123456 | In Progress | 04-25-2022 06:30
123456 | Additional Documentation Needed | 04-25-2022 06:48
123456 | Completed | 04-25-2022 06:52
234567| In Progress | 04-23-2022 08:25
234567| Query Pending | 04-23-2022 08:30
234567 | In Progress | 04-25-2022 13:00
234567| Completed | 04-25-2022 13:17
从这些数据中,我需要找到In Progress和Completed之间的持续时间。我用R来做这个,但我不知道最好的方法。最大的障碍是编写一个脚本,该脚本将根据每个状态的帐户编号运行;例如,对于123456,datediff(status = In Progress, status = Completed)"然后继续到下一个帐户num -但是for循环感觉不是最干净的方法。
我已经设法让这个工作:
library(dplyr)
# duration of all activity per account
activity <- df %>%
group_by(ID) %>%
summarize(
first_activity = min(activitydatetime),
last_activity = max(activitydatetime),
activity = last_activity - first_activity)
问题是报告并不总是以status = In Progress开始,所以我仍然需要能够调整我的脚本来查看状态…例如,first_activity =第一次的activitydatetime状态= In Progress ID 123456
不幸的是,执行以下操作会得到整个报告的第一个和最后一个活动日期时间。
activity <- df %>%
group_by(ID) %>%
summarize(
first_activity = df[first(which(df$status %in% "In Progress")),]$activitydatetime,
last_activity = df[last(which(df$status %in% "Completed")),]$activitydatetime,
activity = last_activity - first_activity)
谢谢你看一看!
一种选择是先将filter
放到所需的行中。我也用slice
取In Progress
的最早实例。我这样做是因为ID = 234567
有2个In Progress
实例。然后,我们可以使用diff
进行汇总,以获得两行之间的差异。然后,我添加了一个mutate
语句,以便可以更改单位。
library(tidyverse)
df %>%
mutate(activity_datetime = strptime(activity_datetime, '%m-%d-%Y %H:%M')) %>%
arrange(ID, status, activity_datetime) %>%
group_by(ID, status) %>%
filter(status %in% c("In Progress", "Completed")) %>%
slice(1) %>%
group_by(ID) %>%
summarize(activity = abs(diff(activity_datetime))) %>%
mutate(activity = as.numeric(activity, units = 'mins'))
ID activity
<int> <drtn>
1 123456 22 mins
2 234567 3172 mins
您可以将完成的时间相加,然后取差值。然后是一个分组过滤器,以获得每个ID的时间差异最大的行。
library(tidyverse)
# data:
d <- structure(list(ID = c(123456, 123456, 123456, 123456, 234567, 234567, 234567, 234567), status = c("Account Ready", "In Progress", "Additional Documentation Needed", "Completed", "In Progress", "Query Pending", "In Progress", "Completed"), datetime = structure(c(1650868200, 1650868200, 1650869280, 1650869520, 1650702300, 1650702600, 1650891600, 1650892620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
d |>
filter(status == "Completed") |>
rename(completed = datetime) |>
select(-status) |>
right_join(d) |>
mutate(diff = completed - datetime) |>
group_by(ID) |>
slice(which.max(diff))
#> Joining, by = "ID"
#> # A tibble: 2 × 5
#> # Groups: ID [2]
#> ID completed status datetime diff
#> <dbl> <dttm> <chr> <dttm> <drtn>
#> 1 123456 2022-04-25 06:52:00 Account Ready 2022-04-25 06:30:00 1320 secs
#> 2 234567 2022-04-25 13:17:00 In Progress 2022-04-23 08:25:00 190320 secs
由reprex包(v2.0.1)于2022-04-26创建