背景:
我有一个数据集df,我想在其中遵循关于时间戳的特定模式。我想先做
1. Identify the 'Connect' value timestamp
2. Check the action that follows, and check to see if the next action
is an 'Ended' or 'Attempt' with a less than or equal to 60 second gap
3. If this <= gap of 60 second is present, I wish for the code to Skip these timestamps
and keep iterating until it comes to the next 'Ended' value, and to record this value.
输出模式应始终遵循"连接"one_answers"结束">
We start with:
Connect 4/6/2020 1:11:41 PM
Then look to the next line:
Ended 4/6/2020 1:14:20 PM
Now look to the line that follows:
Attempt 4/6/2020 1:15:20 PM
These two timestamps are less than or equal to 60 seconds, so we keep going
until we come across an Ended value where these conditions do not apply.
So the Ended value of
Ended 4/6/2020 2:05:18 PM gets recorded.
Action Time
Connect 4/6/2020 1:11:41 PM
Ended 4/6/2020 1:14:20 PM
Attempt 4/6/2020 1:15:20 PM
Connect 4/6/2020 1:15:21 PM
Ended 4/6/2020 2:05:18 PM
Connect 3/31/2020 11:00:08 AM
Ended 3/31/2020 11:14:54 AM
Ended 3/31/2020 4:17:43 PM
正如我们在下面看到的,这些行已经从下午1:14:20和下午1:15:20被删除,它们之间的时间间隔不到60秒2020年3月31日下午4:17:43不是我们遇到的下一个即时"结束"值。
Ended 4/6/2020 1:14:20 PM
Attempt 4/6/2020 1:15:20 PM
Connect 4/6/2020 1:15:21 PM
Ended 3/31/2020 4:17:43 PM
所需输出:
Action Time
Connect 4/6/2020 1:11:41 PM
Ended 4/6/2020 2:05:18 PM
Connect 3/31/2020 11:00:08 AM
Ended 3/31/2020 11:14:54 AM
输出模式应始终遵循"连接"one_answers"结束">
Dput:
structure(list(Action = structure(c(2L, 3L, 1L, 2L, 3L, 2L, 3L,
3L), .Label = c("Attempt", "Connect", "Ended"), class = "factor"),
Time = structure(c(4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L), .Label = c("3/31/2020 11:00:08 AM",
"3/31/2020 11:14:54 AM", "3/31/2020 4:17:43 PM", "4/6/2020 1:11:41 PM",
"4/6/2020 1:14:20 PM", "4/6/2020 1:15:20 PM", "4/6/2020 1:15:21 PM",
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
这是我尝试过的:
我想我应该使用一个循环,但不太确定如何构建它。感谢您的帮助。
library(lubridate)
if (value <= 60) {
print("")
} else {
Expr2
}
我们可以将'Time'从lubridate
转换为Datetime
类和mdy_hms
,根据'Action'中出现的'Connect'创建一个分组变量lee,得到'Time'元素中的差异('Diff'(,filter
去掉差异小于或等于60的行,然后filter
去掉'Action'的duplicated
类似元素行
library(dplyr)
library(lubridate)
library(data.table)
df1 %>%
mutate(Time1 = mdy_hms(Time)) %>%
group_by(grp = cumsum(Action == 'Connect')) %>%
mutate(Diff = difftime(Time1, lag(Time1), unit = 'sec'),
Diff = case_when(any(Diff <=60) ~ 60, TRUE ~ as.numeric(Diff))) %>%
filter(Action == 'Connect'|Diff >60) %>%
ungroup %>%
filter(!duplicated(rleid(Action))) %>%
select(Action, Time)
# A tibble: 4 x 2
# Action Time
# <fct> <fct>
#1 Connect 4/6/2020 1:11:41 PM
#2 Ended 4/6/2020 2:05:18 PM
#3 Connect 3/31/2020 11:00:08 AM
#4 Ended 3/31/2020 11:14:54 AM
这里有一个使用dplyr
、data.table
和lubridate
的方法。
首先,我们计算数据集中经过的累积时间。接下来,我们使用cumsum
将数据集分解为间隔大于60秒的连接尝试。然后,我们按连接尝试进行分组,只有在第一次连接尝试后超过60秒时才保留非连接事件。然后借用@akrun的方法,过滤重复的连续操作。
library(lubridate)
library(dplyr)
library(data.table)
df %>%
mutate(Time = mdy_hms(Time)) %>%
dplyr::arrange(Time) %>%
mutate(CumTime = cumsum(time_length(Time - dplyr::lag(Time, 1L,default = as.integer(min(mdy_hms(df$Time))))))) %>%
group_by(Action) %>%
mutate(LastConnect = if_else(Action == "Connect", time_length(CumTime - dplyr::lag(CumTime, 1L, 0)), 0)) %>%
ungroup %>%
mutate(ConnectionInterval = cumsum(Action == "Connect" & LastConnect > 60)) %>%
dplyr::select(-LastConnect) %>%
group_by(ConnectionInterval) %>%
mutate(ConnectCumTime = time_length(Time - dplyr::lag(Time, 1L))) %>%
filter(Action == "Connect" | ConnectCumTime > 60 & !duplicated(rleid(Action)))
## A tibble: 6 x 5
## Groups: ConnectionInterval [3]
# Action Time CumTime ConnectionInterval ConnectCumTime
# <fct> <dttm> <dbl> <int> <dbl>
#1 Connect 2020-03-31 11:00:08 0 0 NA
#2 Ended 2020-03-31 11:14:54 886 0 886
#3 Connect 2020-04-06 13:11:41 526293 1 NA
#4 Ended 2020-04-06 13:14:20 526452 1 159
#5 Connect 2020-04-06 13:15:21 526513 2 NA
#6 Ended 2020-04-06 14:05:18 529510 2 2997