r按id和id内的多个状态更改列出的最小-最大日期



我有一个动物跟踪数据集,如下所示

Id         Start       Stop          Status
78122      10/12/1919  10/12/1919    Birth
78122      1/18/1966   2/2/1972      In
78122      2/3/1972    9/8/1972      In
78122      9/9/1972    1/23/1974     In
78122      1/24/1974   10/22/1975    Out
78122      10/23/1975  5/4/1979      Out
78122      5/5/1979    8/29/1980     Out
78122      8/30/1980   5/14/1988     Out
78122      5/15/1988   6/18/1988     In
78122      6/19/1988   1/12/1989     In
78122      1/13/1989   2/23/1990     In
78122      2/24/1990   6/15/1991     Out
78122      6/16/1991   2/11/1993     Out
78122      2/12/1993   5/3/1994      Out
78122      5/4/1994    7/27/1994     In
78122      7/22/1994   1/25/1996     Out
78122      1/26/1996   11/13/2001    In
78122      11/14/2001  11/19/2001    In
78122      11/20/2001  9/1/2009      In
78122      9/26/2009   9/26/2009     Death

这种动物出生于1919年,但多次进出其本土。我想要创建的是这样一个数据集。我想按状态总结min(Start)max(Stop)的日期。

例如:有三行表示动物在1/18/19661/23/1974之间的区域内。

Id         Start       Stop          Status
78122      1/18/1966   2/2/1972      In
78122      2/3/1972    9/8/1972      In
78122      9/9/1972    1/23/1974     In

该信息应汇总为1行,其中min(Start)max(Stop)与类似

Id         MinStart    MaxStop       Status
78122      1/18/1966   1/23/1974     In

再次,有四行指示动物在1/24/19745/14/1988之间的区域之外。

Id         Start       Stop          Status
78122      1/24/1974   10/22/1975    Out
78122      10/23/1975  5/4/1979      Out
78122      5/5/1979    8/29/1980     Out
78122      8/30/1980   5/14/1988     Out

该信息应汇总为1行,其中min(Start)max(Stop)与类似

Id         MinStart    MaxStop       Status
78122      1/24/1974   5/14/1988     Out

与其他输入和输出状态类似。最终的数据集应该如下所示。

Id         MinStart    MaxStop       Status
78122      10/12/1919  10/12/1919    Birth
78122      1/18/1966   1/23/1974     In
78122      1/24/1974   5/14/1988     Out
78122      5/15/1988   2/23/1990     In
78122      2/24/1990   5/3/1994      Out
78122      5/4/1994    7/27/1994     In
78122      7/28/1994   1/25/1996     Out
78122      1/26/1996   9/1/2009      In
78122      9/26/2009   9/26/2009     Death

任何关于如何根据上述标准重新排列数据集的建议都是事先准备好的。到目前为止,我尝试了

test1 <- testcase %>% 
group_by(ID,Status) %>% 
summarize(MinStart  = min(Start), MaxStop= max(Stop))

但这似乎不起作用。它只为所有输入状态和输出状态创建一个分钟和停止日期。这是不正确的。

您需要一些游程长度编码。为了方便起见,我将使用data.table::rleid,但如果你想的话,你可以使用基本版本:

library(data.table)
testcase %>% 
group_by(Id, RLE = rleid(Status)) %>%
arrange(Start) %>%
dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups:   Id [1]
Id   RLE Start      Stop       Status
<int> <int> <date>     <date>     <chr> 
1 78122     1 1919-10-12 1919-10-12 Birth 
2 78122     2 1966-01-18 1974-01-23 In    
3 78122     3 1974-01-24 1988-05-14 Out   
4 78122     4 1988-05-15 1990-02-23 In    
5 78122     5 1990-02-24 1994-05-03 Out   
6 78122     6 1994-05-04 1994-07-27 In    
7 78122     7 1994-07-22 1996-01-25 Out   
8 78122     8 1996-01-26 2009-09-01 In    
9 78122     9 2009-09-26 2009-09-26 Death 

请注意,我已将您的日期转换为date类,我将留给您。否则它们不会正确排序。

这是没有data.tablegroup_by呼叫

...
group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...

实现这一点的一种方法是捕获日期,同时使用sapply将其强制转换为数值,以便以后能够使用range。然后,在ave中,我们在mapply中使用rle,使变量x在每次Status更改时增长1。我们现在可以很容易地aggregaterangeIdx上,其中列子集已经给了我们结果,我们只需要将as.Datecbind(x的后缀(转换为gsub

d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\s", "", r$x))

结果

r[order(r$Id), ]
#       Id   MinStart    MaxStop Status
# 1  78122 1919-10-12 1919-10-12  Birth
# 2  78122 1966-01-18 1974-01-23     In
# 3  78122 1974-01-24 1988-05-14    Out
# 4  78122 1988-05-15 1990-02-23     In
# 5  78122 1990-02-24 1994-05-03    Out
# 6  78122 1994-05-04 1994-07-27     In
# 7  78122 1994-07-22 1996-01-25    Out
# 8  78122 1996-01-26 2009-09-01     In
# 9  78122 2009-09-26 2009-09-26  Death
# 10 78123 1919-10-12 1919-10-12  Birth
# 11 78123 1966-01-18 1974-01-23     In
# 12 78123 1974-01-24 1988-05-14    Out
# 13 78123 1988-05-15 1990-02-23     In
# 14 78123 1990-02-24 1994-05-03    Out
# 15 78123 1994-05-04 1994-07-27     In
# 16 78123 1994-07-22 1996-01-25    Out
# 17 78123 1996-01-26 2009-09-01     In
# 18 78123 2009-09-26 2009-09-26  Death

数据:

注意:数据帧加倍,Id增加一以进行演示。

d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth", 
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In", 
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death", 
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In", 
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In", 
"Death")), class = "data.frame", row.names = c(NA, -40L))

例如,您可以使用insurancerating::reduce():

library(insurancerating)
library(dplyr)
library(lubridate)
d %>% 
mutate(across(c(Start, Stop), lubridate::mdy)) %>%
insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)
Id Status index      Start       Stop
# 1  78122  Birth     0 1919-10-12 1919-10-12
# 2  78122  Death     0 2009-09-26 2009-09-26
# 3  78122     In     0 1966-01-18 1974-01-23
# 4  78122     In     1 1988-05-15 1990-02-23
# 5  78122     In     2 1994-05-04 1994-07-27
# 6  78122     In     3 1996-01-26 2009-09-01
# 7  78122    Out     0 1974-01-24 1988-05-14
# 8  78122    Out     1 1990-02-24 1994-05-03
# 9  78122    Out     2 1994-07-22 1996-01-25
# 10 78123  Birth     0 1919-10-12 1919-10-12
# 11 78123  Death     0 2009-09-26 2009-09-26
# 12 78123     In     0 1966-01-18 1974-01-23
# 13 78123     In     1 1988-05-15 1990-02-23
# 14 78123     In     2 1994-05-04 1994-07-27
# 15 78123     In     3 1996-01-26 2009-09-01
# 16 78123    Out     0 1974-01-24 1988-05-14
# 17 78123    Out     1 1990-02-24 1994-05-03
# 18 78123    Out     2 1994-07-22 1996-01-25

注:d为@jay.sf 提供的数据

相关内容

  • 没有找到相关文章