在R中,是否有一种方法可以根据应用于其他列的公共值和标准获得两行之间的差异?



在我的报告中,每一行都有一个ID,一个状态和一个活动日期时间;由此,我需要计算出一个ID的两个相关状态之间的持续时间。例如,数据可能如下所示:

ID | status | activitydatetime
123456 | Account Ready | 04-25-2022 06:30
123456 | In Progress | 04-25-2022 06:30
123456 | Additional Documentation Needed | 04-25-2022 06:48
123456 | Completed | 04-25-2022 06:52
234567| In Progress | 04-23-2022 08:25
234567| Query Pending | 04-23-2022 08:30
234567 | In Progress | 04-25-2022 13:00
234567| Completed | 04-25-2022 13:17

从这些数据中,我需要找到In Progress和Completed之间的持续时间。我用R来做这个,但我不知道最好的方法。最大的障碍是编写一个脚本,该脚本将根据每个状态的帐户编号运行;例如,对于123456,datediff(status = In Progress, status = Completed)"然后继续到下一个帐户num -但是for循环感觉不是最干净的方法。

我已经设法让这个工作:

library(dplyr)
# duration of all activity per account
activity <- df %>%
group_by(ID) %>%
summarize(
first_activity = min(activitydatetime),
last_activity = max(activitydatetime),
activity = last_activity - first_activity)

问题是报告并不总是以status = In Progress开始,所以我仍然需要能够调整我的脚本来查看状态…例如,first_activity =第一次的activitydatetime状态= In Progress ID 123456

不幸的是,执行以下操作会得到整个报告的第一个和最后一个活动日期时间。

activity <- df %>%
group_by(ID) %>%
summarize(
first_activity = df[first(which(df$status %in% "In Progress")),]$activitydatetime,
last_activity = df[last(which(df$status %in% "Completed")),]$activitydatetime,
activity = last_activity - first_activity)

谢谢你看一看!

一种选择是先将filter放到所需的行中。我也用sliceIn Progress的最早实例。我这样做是因为ID = 234567有2个In Progress实例。然后,我们可以使用diff进行汇总,以获得两行之间的差异。然后,我添加了一个mutate语句,以便可以更改单位。

library(tidyverse)
df %>%
mutate(activity_datetime = strptime(activity_datetime, '%m-%d-%Y %H:%M')) %>%
arrange(ID, status, activity_datetime) %>%
group_by(ID, status) %>%
filter(status %in% c("In Progress", "Completed")) %>%
slice(1) %>%
group_by(ID) %>%
summarize(activity = abs(diff(activity_datetime))) %>%
mutate(activity = as.numeric(activity, units = 'mins'))

ID activity 
<int> <drtn>   
1 123456   22 mins
2 234567 3172 mins

您可以将完成的时间相加,然后取差值。然后是一个分组过滤器,以获得每个ID的时间差异最大的行。

library(tidyverse)
# data:
d <- structure(list(ID = c(123456, 123456, 123456, 123456, 234567, 234567, 234567, 234567), status = c("Account Ready", "In Progress", "Additional Documentation Needed", "Completed", "In Progress", "Query Pending", "In Progress", "Completed"), datetime = structure(c(1650868200, 1650868200, 1650869280, 1650869520, 1650702300, 1650702600, 1650891600, 1650892620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
d |> 
filter(status == "Completed") |> 
rename(completed = datetime) |> 
select(-status) |> 
right_join(d) |> 
mutate(diff = completed - datetime) |> 
group_by(ID) |> 
slice(which.max(diff))
#> Joining, by = "ID"
#> # A tibble: 2 × 5
#> # Groups:   ID [2]
#>       ID completed           status        datetime            diff       
#>    <dbl> <dttm>              <chr>         <dttm>              <drtn>     
#> 1 123456 2022-04-25 06:52:00 Account Ready 2022-04-25 06:30:00   1320 secs
#> 2 234567 2022-04-25 13:17:00 In Progress   2022-04-23 08:25:00 190320 secs

由reprex包(v2.0.1)于2022-04-26创建

最新更新