我有一个这样的数据帧[df]:
marker_id timestamp
A 2020-10-01 07:32:14
A 2020-10-01 07:34:09
A 2020-10-01 10:13:00
A 2020-10-05 06:55:04
A 2020-10-05 06:59:14
A 2020-10-09 17:44:02
B 2020-10-01 13:58:47
B 2020-10-02 04:11:38
B 2020-10-02 04:15:07
我想知道是否有办法为每个marker_id的单个时间戳序列分配组。该序列应包含 2 行之间的差异在不到 1 小时内的行。
所需的结果如下所示:
marker_id timestamp group_id
A 2020-10-01 07:32:14 1
A 2020-10-01 07:34:09 1
A 2020-10-01 10:13:00 2
A 2020-10-05 06:55:04 3
A 2020-10-05 06:59:14 3
A 2020-10-09 17:44:02 4
B 2020-10-01 13:58:47 5
B 2020-10-02 04:11:38 6
B 2020-10-02 04:15:07 6
有什么办法可以做到这一点吗?感谢您的任何建议。
这里有一些虚拟数据。
library(tidyverse)
df <- tribble(
~marker_id, ~timestamp,
'A', '2020-10-01 07:32:14',
'A', '2020-10-01 07:34:09',
'A', '2020-10-01 10:13:00',
'A', '2020-10-05 06:55:04',
'A', '2020-10-05 06:59:14',
'A', '2020-10-09 17:44:02',
'B', '2020-10-01 13:58:47',
'B', '2020-10-02 04:11:38',
'B', '2020-10-02 04:15:07'
) %>%
mutate(timestamp = lubridate::as_datetime(timestamp))
使用lag
获取行之间的差异并将其转换为小时。group_id
可以通过检查它是否>= 1
并获取cumsum
来计算。
df %>%
mutate(
group_id = cumsum(
abs(as.numeric(timestamp - lag(timestamp, default = min(timestamp)), units = "hours")) >= 1
) + 1
)
#> # A tibble: 9 x 3
#> marker_id timestamp group_id
#> <chr> <dttm> <dbl>
#> 1 A 2020-10-01 07:32:14 1
#> 2 A 2020-10-01 07:34:09 1
#> 3 A 2020-10-01 10:13:00 2
#> 4 A 2020-10-05 06:55:04 3
#> 5 A 2020-10-05 06:59:14 3
#> 6 A 2020-10-09 17:44:02 4
#> 7 B 2020-10-01 13:58:47 5
#> 8 B 2020-10-02 04:11:38 6
#> 9 B 2020-10-02 04:15:07 6