r语言 - 如果超过一天,则通过拆分为两个不同的日期来查找数据框列中开始标志和结束标志之间的持续时间



我有一个这样的数据帧:

timestamp           Status
05-01-2020  12:07:08    0
05-01-2020  12:36:05    1
05-01-2020  23:45:02    0
05-01-2020  13:44:33    1
06-01-2020  01:07:08    1
06-01-2020  10:23:05    1
06-01-2020  12:11:08    1
06-01-2020  22:06:12    1
07-01-2020  00:01:05    0
07-01-2020  02:17:09    1
07-01-2020  12:36:05    1
07-01-2020  12:07:08    1
07-01-2020  12:36:05    1
07-01-2020  12:36:05    0
08-01-2020  12:36:05    1
08-01-2020  12:36:05    0
08-01-2020  12:36:05    0
09-01-2020  12:36:05    1
09-01-2020  12:07:08    0
09-01-2020  12:36:05    1
11-01-2020  12:07:08    0
11-01-2020  12:36:05    1

我正在尝试找到每对1,0之间的持续时间。但我的数据我可以有状态以任何顺序出现。我可能有 1 和 0 出现 0ne 一个。或者我可能有很多 1 后跟一个 0 等。如果开始 (1( 在当天打开并且结束 (0( 在第二天,我正在尝试将持续时间一分为二,前提是它们是连续的日期(如 1,2,3,4(并且中间没有 1 或 1 和 0 之间有任何数量的 1。 第一次出现 1 就像开始...第一次出现 0 就像结束。

我能够在直接的条件下计算 1 和 0 是否在同一日期。此外,如果是在两个日期,我能够计算第一天出现 1 和 23:59:59 之间的差异,同样从 00:00:00 到第二天发生。

例如:让我有一组这样的数据

07-01-2020  21:26:05    1
08-01-2020  02:33:45    0

这两个落在两个不同的日期。因此,与其直接找到差异,不如将其一分为二。所以在第一天(07-01-2020(我的持续时间将从21:26:05 to 23:59:59开始,第二天它将来自00:00:00 to 02:33:45。对于任意数量的连续日期,应重复此操作。(如7,8,9,10等(

但是如果有这样的数据:

07-01-2020  21:26:05    1
08-01-2020  02:33:45    1
09-01-2020  21:26:05    1
11-01-2020  02:33:45    1

我必须削减(因为在 9 日之后它是 11 日,所以连续性被打破了(

07-01-2020  21:26:05 to  07-01-2020  23:59:59
08-01-2020  00:00:00 to  08-01-2020  02:33:45
08-01-2020  02:33:45 to  08-01-2020  23:59:59
09-01-2020  00:00:00 to  09-01-2020  21:26:05
09-01-2020  21:26:05 to  09-01-2020  23:59:59

像这样的条件:

07-01-2020  21:26:05    1
07-01-2020  22:33:45    1
07-01-2020  23:31:51    1
07-01-2020  23:48:33    0
07-01-2020  23:57:12    0

与:

07-01-2020  21:26:05    1
07-01-2020  23:48:33    0

和这样的条件:

07-01-2020  21:26:05    1
07-01-2020  22:33:45    1
07-01-2020  23:31:51    1
08-01-2020  03:48:33    0
08-01-2020  03:57:12    0

与:

07-01-2020  21:26:05   to  07-01-2020  23:59:59
07-01-2020  00:00:00   to  08-01-2020  03:48:33 

我在数据表中尝试了ifelse条件,并且能够在第一天从x到23:59:59进行第一次拆分。但没有其他条件起作用。

df[, difference := ifelse((df$Status == 0 & shift(df$Status,type='lag') == 1) & (as.Date(df$timestamp) !=  shift(as.Date(df$timestamp),type = 'lag')),
as.numeric(df$timestamp - as.POSIXct(paste0(as.Date(timestamp)," ","00:00:00"),tz="UTC"),units='mins'),ifelse((df$Status == 1 & shift(df$Status,type='lead') == 0) & as.Date(df$timestamp) !=  shift(as.Date(df$timestamp),type = 'lead'),as.numeric(as.POSIXct(paste0(as.Date(timestamp)," ","23:59:59"),tz="UTC") - df$timestamp,units='mins'),
as.numeric(shift(df$timestamp,type = 'lead') -  df$timestamp,units='mins')))]

A = data.frame(timestamp = c(as.POSIXlt("2020-07-01 21:26:05"), 
as.POSIXlt("2020-07-02  02:33:45"), 
as.POSIXlt("2020-07-02  10:33:45"),
as.POSIXlt("2020-07-03  15:33:45"),
as.POSIXlt("2020-07-04  02:33:45")),
ind = as.numeric(c(0, 1, 1, 0, 1) ))
> A
timestamp ind
1 2020-07-01 21:26:05   0
2 2020-07-02 02:33:45   1
3 2020-07-02 10:33:45   1
4 2020-07-03 15:33:45   0
5 2020-07-04 02:33:45   1

是此示例的玩具数据。然后,以下代码为您提供连续 0 和 1 的第一次出现之间的时间距离。

A %>%
mutate(Diff = ind - lag(ind)) %>% 
filter(is.na(Diff) | Diff != 0) %>% 
mutate(Timedist = timestamp - lag(timestamp)) %>%
select(-Diff)

带输出

timestamp ind   Timedist
1 2020-07-01 21:26:05   0    NA hours
2 2020-07-02 02:33:45   1   5.1 hours
3 2020-07-03 15:33:45   0  37.0 hours
4 2020-07-04 02:33:45   1  11.0 hours
library(tidyverse)
# Non-daily split: 
df %>% 
mutate(grp = cumsum(ifelse(ind == 0, 1, 0))) %>% 
group_by(grp) %>% 
filter(!(duplicated(ind))) %>% 
ungroup() %>% 
mutate(duration = difftime(timestamp, lag(timestamp), units = "hours"))

# Daily split: 
df %>% 
group_by(grp1 = as.Date(timestamp, "%Y-%m-%d")) %>% 
filter(!duplicated(ind)) %>% 
ungroup() %>% 
mutate(grp = cumsum(ifelse(ind == 0, 1, 0))) %>% 
group_by(grp, grp1) %>% 
mutate(duration = difftime(timestamp, lag(timestamp), units = "hours")) %>% 
ungroup()

最新更新