这个问题类似于这里的问题按ID 按年份划分的r崩溃
然而,我喜欢按ID和状态折叠时间表,前提是它们之间的时间间隔为31天。如果差距超过31天,那么他们没有崩溃,而是从新的一排开始。例如,如果这是我的数据集
ID From To State
1 2004-04-05 2005-02-05 MD
1 2005-03-05 2005-03-05 MD
1 2005-04-05 2005-10-05 DC
1 2006-03-05 2006-10-05 DC
1 2006-11-05 2007-03-05 DC
1 2007-04-05 2007-06-05 MD
1 2008-03-05 2008-11-05 MD
1 2008-12-05 2010-08-05 MD
1 2010-09-05 2012-11-05 MD
2 2003-05-05 2004-08-05 OR
2 2004-09-05 2009-03-05 OR
2 2010-06-05 2010-08-05 AZ
2 2013-06-05 2015-06-05 AZ
折叠后的最终数据集看起来像这个
ID From To State
1 2004-04-05 2005-03-05 MD
1 2005-04-05 2005-10-05 DC
1 2006-04-05 2007-03-05 DC
1 2007-04-05 2007-06-05 MD
1 2008-03-05 2012-11-05 MD
2 2003-05-05 2009-03-05 OR
2 2010-06-05 2010-08-05 AZ
2 2013-06-05 2015-06-05 AZ
如有任何建议,我们将不胜感激。
测试用例2:
ID From To State
1 2003-09-05 2003-11-05 MD
1 2004-09-05 2007-05-05 TX
1 2007-06-05 2007-07-05 DC
1 2007-08-05 2009-07-05 DC
1 2011-11-05 2014-03-05 MD
1 2014-05-05 2017-06-05 MD
预期结果
ID From To State
1 2003-09-05 2003-11-05 MD
1 2004-09-05 2007-05-05 TX
1 2007-06-05 2009-07-05 DC
1 2011-11-05 2017-06-05 MD
从上一个To
日期减去当前From
日期,创建一个新的分组列,并在每个组中选择first
From
值和last
To
值。
library(dplyr)
df %>%
mutate(across(c(From, To), as.Date)) %>%
group_by(ID, State,
group = cumsum(From - dplyr::lag(To, default = as.Date('1970-01-01')) > 31)) %>%
summarise(From = first(From),
To = last(To), .groups = 'drop') %>%
select(-group) %>%
arrange(ID, From)
# ID State From To
# <int> <chr> <date> <date>
#1 1 MD 2004-04-05 2005-03-05
#2 1 DC 2005-04-05 2005-10-05
#3 1 DC 2006-03-05 2007-03-05
#4 1 MD 2007-04-05 2007-06-05
#5 1 MD 2008-03-05 2012-11-05
#6 2 OR 2003-05-05 2009-03-05
#7 2 AZ 2010-06-05 2010-08-05
#8 2 AZ 2013-06-05 2015-06-05