r-使用循环从日期时间序列中删除复杂模式



背景:

我有一个数据集df,我想在其中遵循关于时间戳的特定模式。我想先做

1. Identify the 'Connect' value timestamp
2. Check the action that follows, and check to see if the next action
is an 'Ended' or 'Attempt' with a less than or equal to 60 second gap
3. If this <= gap of 60 second is present, I wish for the code to Skip these timestamps
and keep iterating until it comes to the next 'Ended' value, and to record this value.

输出模式应始终遵循"连接"one_answers"结束">

We start with:
Connect            4/6/2020 1:11:41 PM
Then look to the next line:
Ended              4/6/2020 1:14:20 PM
Now look to the line that follows:
Attempt            4/6/2020 1:15:20 PM


These two timestamps are less than or equal to 60 seconds, so we keep going    
until we come across an Ended value where these conditions do not apply. 
So the Ended value of 
Ended              4/6/2020 2:05:18 PM    gets recorded.




Action             Time
Connect            4/6/2020 1:11:41 PM
Ended              4/6/2020 1:14:20 PM
Attempt            4/6/2020 1:15:20 PM
Connect            4/6/2020 1:15:21 PM
Ended              4/6/2020 2:05:18 PM
Connect            3/31/2020 11:00:08 AM
Ended              3/31/2020 11:14:54 AM
Ended              3/31/2020 4:17:43 PM

正如我们在下面看到的,这些行已经从下午1:14:20和下午1:15:20被删除,它们之间的时间间隔不到60秒2020年3月31日下午4:17:43不是我们遇到的下一个即时"结束"值。

Ended              4/6/2020 1:14:20 PM
Attempt            4/6/2020 1:15:20 PM
Connect            4/6/2020 1:15:21 PM
Ended              3/31/2020 4:17:43 PM

所需输出:

Action              Time

Connect             4/6/2020 1:11:41 PM        
Ended               4/6/2020 2:05:18 PM
Connect             3/31/2020 11:00:08 AM
Ended               3/31/2020 11:14:54 AM

输出模式应始终遵循"连接"one_answers"结束">

Dput:

structure(list(Action = structure(c(2L, 3L, 1L, 2L, 3L, 2L, 3L, 
3L), .Label = c("Attempt", "Connect", "Ended"), class =     "factor"), 
Time = structure(c(4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L), .Label =      c("3/31/2020 11:00:08 AM", 
"3/31/2020 11:14:54 AM", "3/31/2020 4:17:43 PM", "4/6/2020      1:11:41 PM", 
"4/6/2020 1:14:20 PM", "4/6/2020 1:15:20 PM", "4/6/2020  1:15:21   PM", 
"4/6/2020 2:05:18 PM"), class = "factor")), class =     "data.frame", row.names = c(NA, 
-8L))

这是我尝试过的:

我想我应该使用一个循环,但不太确定如何构建它。感谢您的帮助。

library(lubridate)
if (value <= 60) {
print("") 
} else {
Expr2
}

我们可以将'Time'从lubridate转换为Datetime类和mdy_hms,根据'Action'中出现的'Connect'创建一个分组变量lee,得到'Time'元素中的差异('Diff'(,filter去掉差异小于或等于60的行,然后filter去掉'Action'的duplicated类似元素行

library(dplyr)
library(lubridate)
library(data.table)
df1 %>%
mutate(Time1 = mdy_hms(Time)) %>%
group_by(grp = cumsum(Action == 'Connect')) %>% 
mutate(Diff = difftime(Time1, lag(Time1), unit = 'sec'),
Diff = case_when(any(Diff <=60) ~ 60, TRUE ~ as.numeric(Diff))) %>%
filter(Action == 'Connect'|Diff >60) %>%
ungroup %>% 
filter(!duplicated(rleid(Action))) %>% 
select(Action, Time)
# A tibble: 4 x 2
#  Action  Time                    
#  <fct>   <fct>                   
#1 Connect 4/6/2020      1:11:41 PM
#2 Ended   4/6/2020 2:05:18 PM     
#3 Connect 3/31/2020 11:00:08 AM   
#4 Ended   3/31/2020 11:14:54 AM   

这里有一个使用dplyrdata.tablelubridate的方法。

首先,我们计算数据集中经过的累积时间。接下来,我们使用cumsum将数据集分解为间隔大于60秒的连接尝试。然后,我们按连接尝试进行分组,只有在第一次连接尝试后超过60秒时才保留非连接事件。然后借用@akrun的方法,过滤重复的连续操作。

library(lubridate)
library(dplyr)
library(data.table)
df %>% 
mutate(Time = mdy_hms(Time)) %>%
dplyr::arrange(Time) %>%
mutate(CumTime = cumsum(time_length(Time - dplyr::lag(Time, 1L,default = as.integer(min(mdy_hms(df$Time))))))) %>%
group_by(Action) %>%
mutate(LastConnect = if_else(Action == "Connect", time_length(CumTime - dplyr::lag(CumTime, 1L, 0)), 0)) %>%
ungroup %>%
mutate(ConnectionInterval = cumsum(Action == "Connect" & LastConnect > 60)) %>%
dplyr::select(-LastConnect) %>%
group_by(ConnectionInterval) %>%
mutate(ConnectCumTime = time_length(Time - dplyr::lag(Time, 1L))) %>% 
filter(Action == "Connect" | ConnectCumTime > 60 & !duplicated(rleid(Action)))
## A tibble: 6 x 5
## Groups:   ConnectionInterval [3]
#  Action  Time                CumTime ConnectionInterval ConnectCumTime
#  <fct>   <dttm>                <dbl>              <int>          <dbl>
#1 Connect 2020-03-31 11:00:08       0                  0             NA
#2 Ended   2020-03-31 11:14:54     886                  0            886
#3 Connect 2020-04-06 13:11:41  526293                  1             NA
#4 Ended   2020-04-06 13:14:20  526452                  1            159
#5 Connect 2020-04-06 13:15:21  526513                  2             NA
#6 Ended   2020-04-06 14:05:18  529510                  2           2997

最新更新