r-在dplyr或sql中,将第一个时间间隔的行号添加到其连续时间间隔中



我有这样的数据:

a = data.frame( id = c(1:5),
time_left = c('2010-01-04', '2010-02-15', 
'2010-03-01', '2010-03-08', 
'2010-03-15'), 
time_right = c('2010-01-28', '2010-03-02', 
'2010-03-07', '2010-03-14', 
'2010-03-23'))

我需要为它的连续时间间隔找到第一个时间间隔的id。我的主数据集太大,我需要一个有效的解决方案。

以下是输入和我想要的输出:

输入:

id  time_left time_right
1   2010-01-04 2010-01-28
2   2010-02-15 2010-03-02
3   2010-03-01 2010-03-07
4   2010-03-08 2010-03-14
5   2010-03-15 2010-03-23

输出:

id  time_left time_right   group_id
1   2010-01-04 2010-01-28   1
2   2010-02-15 2010-03-02   1
3   2010-03-01 2010-03-07   3
4   2010-03-08 2010-03-14   3
5   2010-03-15 2010-03-23   3

提前谢谢。

dplyr

library(dplyr)
a %>%
group_by(group_id = c(0L, cumsum(time_left[-1] < time_right[-n()]))) %>%
mutate(group_id = first(id)) %>%
ungroup()
# # A tibble: 5 x 4
#      id time_left  time_right group_id
#   <int> <chr>      <chr>         <int>
# 1     1 2010-01-04 2010-01-28        1
# 2     2 2010-02-15 2010-03-02        1
# 3     3 2010-03-01 2010-03-07        3
# 4     4 2010-03-08 2010-03-14        3
# 5     5 2010-03-15 2010-03-23        3

sql

sqldf::sqldf(
"with cte1 as (
select *, (case when time_left > lag(time_right) over (order by id) then 0 else 1 end) as grp
from a
),
cte2 as (
select *, sum(cte1.grp) over (order by cte1.id) as group_number
from cte1
),
cte3 as (
select cte2.id, cte2.time_left, cte2.time_right,
min(cte2.group_number) as group_number
from cte2
group by group_number
)
select cte2.id, cte2.time_left, cte2.time_right, cte3.id as group_id
from cte2
left join cte3 on cte2.group_number = cte3.group_number")