我有一个这样的数据帧:
timestamp Status
05-01-2020 12:07:08 0
05-01-2020 12:36:05 1
05-01-2020 23:45:02 0
05-01-2020 13:44:33 1
06-01-2020 01:07:08 1
06-01-2020 10:23:05 1
06-01-2020 12:11:08 1
06-01-2020 22:06:12 1
07-01-2020 00:01:05 0
07-01-2020 02:17:09 1
07-01-2020 12:36:05 1
07-01-2020 12:07:08 1
07-01-2020 12:36:05 1
07-01-2020 12:36:05 0
08-01-2020 12:36:05 1
08-01-2020 12:36:05 0
08-01-2020 12:36:05 0
09-01-2020 12:36:05 1
09-01-2020 12:07:08 0
09-01-2020 12:36:05 1
11-01-2020 12:07:08 0
11-01-2020 12:36:05 1
我正在尝试找到每对1,0
之间的持续时间。但我的数据我可以有状态以任何顺序出现。我可能有 1 和 0 出现 0ne 一个。或者我可能有很多 1 后跟一个 0 等。如果开始 (1( 在当天打开并且结束 (0( 在第二天,我正在尝试将持续时间一分为二,前提是它们是连续的日期(如 1,2,3,4(并且中间没有 1 或 1 和 0 之间有任何数量的 1。 第一次出现 1 就像开始...第一次出现 0 就像结束。
我能够在直接的条件下计算 1 和 0 是否在同一日期。此外,如果是在两个日期,我能够计算第一天出现 1 和 23:59:59 之间的差异,同样从 00:00:00 到第二天发生。
例如:让我有一组这样的数据
07-01-2020 21:26:05 1
08-01-2020 02:33:45 0
这两个落在两个不同的日期。因此,与其直接找到差异,不如将其一分为二。所以在第一天(07-01-2020
(我的持续时间将从21:26:05 to 23:59:59
开始,第二天它将来自00:00:00 to 02:33:45
。对于任意数量的连续日期,应重复此操作。(如7,8,9,10等(
但是如果有这样的数据:
07-01-2020 21:26:05 1
08-01-2020 02:33:45 1
09-01-2020 21:26:05 1
11-01-2020 02:33:45 1
我必须削减(因为在 9 日之后它是 11 日,所以连续性被打破了(
07-01-2020 21:26:05 to 07-01-2020 23:59:59
08-01-2020 00:00:00 to 08-01-2020 02:33:45
08-01-2020 02:33:45 to 08-01-2020 23:59:59
09-01-2020 00:00:00 to 09-01-2020 21:26:05
09-01-2020 21:26:05 to 09-01-2020 23:59:59
像这样的条件:
07-01-2020 21:26:05 1
07-01-2020 22:33:45 1
07-01-2020 23:31:51 1
07-01-2020 23:48:33 0
07-01-2020 23:57:12 0
与:
07-01-2020 21:26:05 1
07-01-2020 23:48:33 0
和这样的条件:
07-01-2020 21:26:05 1
07-01-2020 22:33:45 1
07-01-2020 23:31:51 1
08-01-2020 03:48:33 0
08-01-2020 03:57:12 0
与:
07-01-2020 21:26:05 to 07-01-2020 23:59:59
07-01-2020 00:00:00 to 08-01-2020 03:48:33
我在数据表中尝试了ifelse条件,并且能够在第一天从x到23:59:59进行第一次拆分。但没有其他条件起作用。
df[, difference := ifelse((df$Status == 0 & shift(df$Status,type='lag') == 1) & (as.Date(df$timestamp) != shift(as.Date(df$timestamp),type = 'lag')),
as.numeric(df$timestamp - as.POSIXct(paste0(as.Date(timestamp)," ","00:00:00"),tz="UTC"),units='mins'),ifelse((df$Status == 1 & shift(df$Status,type='lead') == 0) & as.Date(df$timestamp) != shift(as.Date(df$timestamp),type = 'lead'),as.numeric(as.POSIXct(paste0(as.Date(timestamp)," ","23:59:59"),tz="UTC") - df$timestamp,units='mins'),
as.numeric(shift(df$timestamp,type = 'lead') - df$timestamp,units='mins')))]
让
A = data.frame(timestamp = c(as.POSIXlt("2020-07-01 21:26:05"),
as.POSIXlt("2020-07-02 02:33:45"),
as.POSIXlt("2020-07-02 10:33:45"),
as.POSIXlt("2020-07-03 15:33:45"),
as.POSIXlt("2020-07-04 02:33:45")),
ind = as.numeric(c(0, 1, 1, 0, 1) ))
> A
timestamp ind
1 2020-07-01 21:26:05 0
2 2020-07-02 02:33:45 1
3 2020-07-02 10:33:45 1
4 2020-07-03 15:33:45 0
5 2020-07-04 02:33:45 1
是此示例的玩具数据。然后,以下代码为您提供连续 0 和 1 的第一次出现之间的时间距离。
A %>%
mutate(Diff = ind - lag(ind)) %>%
filter(is.na(Diff) | Diff != 0) %>%
mutate(Timedist = timestamp - lag(timestamp)) %>%
select(-Diff)
带输出
timestamp ind Timedist
1 2020-07-01 21:26:05 0 NA hours
2 2020-07-02 02:33:45 1 5.1 hours
3 2020-07-03 15:33:45 0 37.0 hours
4 2020-07-04 02:33:45 1 11.0 hours
library(tidyverse)
# Non-daily split:
df %>%
mutate(grp = cumsum(ifelse(ind == 0, 1, 0))) %>%
group_by(grp) %>%
filter(!(duplicated(ind))) %>%
ungroup() %>%
mutate(duration = difftime(timestamp, lag(timestamp), units = "hours"))
# Daily split:
df %>%
group_by(grp1 = as.Date(timestamp, "%Y-%m-%d")) %>%
filter(!duplicated(ind)) %>%
ungroup() %>%
mutate(grp = cumsum(ifelse(ind == 0, 1, 0))) %>%
group_by(grp, grp1) %>%
mutate(duration = difftime(timestamp, lag(timestamp), units = "hours")) %>%
ungroup()