我想将不规则的时间序列拆分为单独的事件,并为每个站点为每个事件分配一个唯一的数字ID。
下面是一个示例数据框:
structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AllenBrook", "Eastberk"), class =
"factor"),
timestamp = structure(c(10L, 13L, 8L, 4L, 5L, 6L, 7L, 9L,
11L, 12L, 1L, 2L, 3L), .Label = c("10/1/12 11:29", "10/1/12 14:29",
"10/1/12 17:29", "10/20/12 16:30", "10/20/12 19:30", "10/21/12 1:30",
"10/21/12 4:30", "9/5/12 12:30", "9/5/12 4:14", "9/5/12 6:30",
"9/5/12 7:14", "9/5/12 7:44", "9/5/12 9:30"), class = "factor")), class
= "data.frame", row.names = c(NA,
-13L))
每个事件的时间戳长度或数量不同,因此如果时间戳与该站点的下一个时间戳之间经过 12 小时以上,我想将它们拆分为单独的事件。现场的每个事件都应收到一个唯一的数字 ID。这是我想要的结果:
site timestamp eventid
1 AllenBrook 9/5/12 6:30 1
2 AllenBrook 9/5/12 9:30 1
3 AllenBrook 9/5/12 12:30 1
4 AllenBrook 10/20/12 16:30 2
5 AllenBrook 10/20/12 19:30 2
6 AllenBrook 10/21/12 1:30 2
7 AllenBrook 10/21/12 4:30 2
8 Eastberk 9/5/12 4:14 1
9 Eastberk 9/5/12 7:14 1
10 Eastberk 9/5/12 7:44 1
11 Eastberk 10/1/12 11:29 2
12 Eastberk 10/1/12 14:29 2
13 Eastberk 10/1/12 17:29 2
任何编码解决方案都可以,但对于tidyverse
或data.table
解决方案来说,这是加分项。感谢您提供的任何帮助!
使用 data.table
,您也许可以执行以下操作:
library(data.table)
setDT(tmp)[, timestamp := as.POSIXct(timestamp, format="%m/%d/%y %H:%M")][,
eventid := 1L+cumsum(c(0L, diff(timestamp)>720)), by=.(site)]
diff(timestamp)
计算相邻行之间的时差。然后我们检查差异是否大于 12h(或 720 分钟)。R 中的一个常见技巧是使用 cumsum
来标识事件何时在序列中发生,并将后续元素与此事件分组在一起,直到下一个事件再次发生。由于cumsum
少返回 1 个元素,我们使用 0L 来填充开头。 1+
只是从 1 而不是 0 开始索引。
输出:
site timestamp eventid
1: AllenBrook 2012-09-05 06:30:00 1
2: AllenBrook 2012-09-05 09:30:00 1
3: AllenBrook 2012-09-05 12:30:00 1
4: AllenBrook 2012-10-20 16:30:00 2
5: AllenBrook 2012-10-20 19:30:00 2
6: AllenBrook 2012-10-21 01:30:00 2
7: AllenBrook 2012-10-21 04:30:00 2
8: Eastberk 2012-09-05 04:14:00 1
9: Eastberk 2012-09-05 07:14:00 1
10: Eastberk 2012-09-05 07:44:00 1
11: Eastberk 2012-10-01 11:29:00 2
12: Eastberk 2012-10-01 14:29:00 2
13: Eastberk 2012-10-01 17:29:00 2
数据:
tmp <- structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AllenBrook", "Eastberk"), class =
"factor"),
timestamp = structure(c(10L, 13L, 8L, 4L, 5L, 6L, 7L, 9L,
11L, 12L, 1L, 2L, 3L), .Label = c("10/1/12 11:29", "10/1/12 14:29",
"10/1/12 17:29", "10/20/12 16:30", "10/20/12 19:30", "10/21/12 1:30",
"10/21/12 4:30", "9/5/12 12:30", "9/5/12 4:14", "9/5/12 6:30",
"9/5/12 7:14", "9/5/12 7:44", "9/5/12 9:30"), class = "factor")), class
= "data.frame", row.names = c(NA,
-13L))