我有这样的数据:
a = data.frame( id = c(1:5),
time_left = c('2010-01-04', '2010-02-15',
'2010-03-01', '2010-03-08',
'2010-03-15'),
time_right = c('2010-01-28', '2010-03-02',
'2010-03-07', '2010-03-14',
'2010-03-23'))
我需要为它的连续时间间隔找到第一个时间间隔的id。我的主数据集太大,我需要一个有效的解决方案。
以下是输入和我想要的输出:
输入:
id time_left time_right
1 2010-01-04 2010-01-28
2 2010-02-15 2010-03-02
3 2010-03-01 2010-03-07
4 2010-03-08 2010-03-14
5 2010-03-15 2010-03-23
输出:
id time_left time_right group_id
1 2010-01-04 2010-01-28 1
2 2010-02-15 2010-03-02 1
3 2010-03-01 2010-03-07 3
4 2010-03-08 2010-03-14 3
5 2010-03-15 2010-03-23 3
提前谢谢。
dplyr
library(dplyr)
a %>%
group_by(group_id = c(0L, cumsum(time_left[-1] < time_right[-n()]))) %>%
mutate(group_id = first(id)) %>%
ungroup()
# # A tibble: 5 x 4
# id time_left time_right group_id
# <int> <chr> <chr> <int>
# 1 1 2010-01-04 2010-01-28 1
# 2 2 2010-02-15 2010-03-02 1
# 3 3 2010-03-01 2010-03-07 3
# 4 4 2010-03-08 2010-03-14 3
# 5 5 2010-03-15 2010-03-23 3
sql
sqldf::sqldf(
"with cte1 as (
select *, (case when time_left > lag(time_right) over (order by id) then 0 else 1 end) as grp
from a
),
cte2 as (
select *, sum(cte1.grp) over (order by cte1.id) as group_number
from cte1
),
cte3 as (
select cte2.id, cte2.time_left, cte2.time_right,
min(cte2.group_number) as group_number
from cte2
group by group_number
)
select cte2.id, cte2.time_left, cte2.time_right, cte3.id as group_id
from cte2
left join cte3 on cte2.group_number = cte3.group_number")