我有一个动物跟踪数据集,如下所示
Id Start Stop Status
78122 10/12/1919 10/12/1919 Birth
78122 1/18/1966 2/2/1972 In
78122 2/3/1972 9/8/1972 In
78122 9/9/1972 1/23/1974 In
78122 1/24/1974 10/22/1975 Out
78122 10/23/1975 5/4/1979 Out
78122 5/5/1979 8/29/1980 Out
78122 8/30/1980 5/14/1988 Out
78122 5/15/1988 6/18/1988 In
78122 6/19/1988 1/12/1989 In
78122 1/13/1989 2/23/1990 In
78122 2/24/1990 6/15/1991 Out
78122 6/16/1991 2/11/1993 Out
78122 2/12/1993 5/3/1994 Out
78122 5/4/1994 7/27/1994 In
78122 7/22/1994 1/25/1996 Out
78122 1/26/1996 11/13/2001 In
78122 11/14/2001 11/19/2001 In
78122 11/20/2001 9/1/2009 In
78122 9/26/2009 9/26/2009 Death
这种动物出生于1919年,但多次进出其本土。我想要创建的是这样一个数据集。我想按状态总结min(Start)
和max(Stop)
的日期。
例如:有三行表示动物在1/18/1966
到1/23/1974
之间的区域内。
Id Start Stop Status
78122 1/18/1966 2/2/1972 In
78122 2/3/1972 9/8/1972 In
78122 9/9/1972 1/23/1974 In
该信息应汇总为1行,其中min(Start)
和max(Stop)
与类似
Id MinStart MaxStop Status
78122 1/18/1966 1/23/1974 In
再次,有四行指示动物在1/24/1974
到5/14/1988
之间的区域之外。
Id Start Stop Status
78122 1/24/1974 10/22/1975 Out
78122 10/23/1975 5/4/1979 Out
78122 5/5/1979 8/29/1980 Out
78122 8/30/1980 5/14/1988 Out
该信息应汇总为1行,其中min(Start)
和max(Stop)
与类似
Id MinStart MaxStop Status
78122 1/24/1974 5/14/1988 Out
与其他输入和输出状态类似。最终的数据集应该如下所示。
Id MinStart MaxStop Status
78122 10/12/1919 10/12/1919 Birth
78122 1/18/1966 1/23/1974 In
78122 1/24/1974 5/14/1988 Out
78122 5/15/1988 2/23/1990 In
78122 2/24/1990 5/3/1994 Out
78122 5/4/1994 7/27/1994 In
78122 7/28/1994 1/25/1996 Out
78122 1/26/1996 9/1/2009 In
78122 9/26/2009 9/26/2009 Death
任何关于如何根据上述标准重新排列数据集的建议都是事先准备好的。到目前为止,我尝试了
test1 <- testcase %>%
group_by(ID,Status) %>%
summarize(MinStart = min(Start), MaxStop= max(Stop))
但这似乎不起作用。它只为所有输入状态和输出状态创建一个分钟和停止日期。这是不正确的。
您需要一些游程长度编码。为了方便起见,我将使用data.table::rleid
,但如果你想的话,你可以使用基本版本:
library(data.table)
testcase %>%
group_by(Id, RLE = rleid(Status)) %>%
arrange(Start) %>%
dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups: Id [1]
Id RLE Start Stop Status
<int> <int> <date> <date> <chr>
1 78122 1 1919-10-12 1919-10-12 Birth
2 78122 2 1966-01-18 1974-01-23 In
3 78122 3 1974-01-24 1988-05-14 Out
4 78122 4 1988-05-15 1990-02-23 In
5 78122 5 1990-02-24 1994-05-03 Out
6 78122 6 1994-05-04 1994-07-27 In
7 78122 7 1994-07-22 1996-01-25 Out
8 78122 8 1996-01-26 2009-09-01 In
9 78122 9 2009-09-26 2009-09-26 Death
请注意,我已将您的日期转换为date
类,我将留给您。否则它们不会正确排序。
这是没有data.table
的group_by
呼叫
...
group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...
实现这一点的一种方法是捕获日期,同时使用sapply
将其强制转换为数值,以便以后能够使用range
。然后,在ave
中,我们在mapply
中使用rle
,使变量x在每次Status更改时增长1。我们现在可以很容易地aggregate
,range
在Id和x上,其中列子集已经给了我们结果,我们只需要将as.Date
和cbind
(x的后缀(转换为gsub
。
d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\s", "", r$x))
结果
r[order(r$Id), ]
# Id MinStart MaxStop Status
# 1 78122 1919-10-12 1919-10-12 Birth
# 2 78122 1966-01-18 1974-01-23 In
# 3 78122 1974-01-24 1988-05-14 Out
# 4 78122 1988-05-15 1990-02-23 In
# 5 78122 1990-02-24 1994-05-03 Out
# 6 78122 1994-05-04 1994-07-27 In
# 7 78122 1994-07-22 1996-01-25 Out
# 8 78122 1996-01-26 2009-09-01 In
# 9 78122 2009-09-26 2009-09-26 Death
# 10 78123 1919-10-12 1919-10-12 Birth
# 11 78123 1966-01-18 1974-01-23 In
# 12 78123 1974-01-24 1988-05-14 Out
# 13 78123 1988-05-15 1990-02-23 In
# 14 78123 1990-02-24 1994-05-03 Out
# 15 78123 1994-05-04 1994-07-27 In
# 16 78123 1994-07-22 1996-01-25 Out
# 17 78123 1996-01-26 2009-09-01 In
# 18 78123 2009-09-26 2009-09-26 Death
数据:
注意:数据帧加倍,Id增加一以进行演示。
d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L,
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L,
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L,
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L,
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L,
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966",
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979",
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990",
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996",
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966",
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979",
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990",
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996",
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919",
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979",
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990",
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996",
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919",
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979",
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990",
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996",
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth",
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In",
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death",
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In",
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In",
"Death")), class = "data.frame", row.names = c(NA, -40L))
例如,您可以使用insurancerating::reduce()
:
library(insurancerating)
library(dplyr)
library(lubridate)
d %>%
mutate(across(c(Start, Stop), lubridate::mdy)) %>%
insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)
Id Status index Start Stop
# 1 78122 Birth 0 1919-10-12 1919-10-12
# 2 78122 Death 0 2009-09-26 2009-09-26
# 3 78122 In 0 1966-01-18 1974-01-23
# 4 78122 In 1 1988-05-15 1990-02-23
# 5 78122 In 2 1994-05-04 1994-07-27
# 6 78122 In 3 1996-01-26 2009-09-01
# 7 78122 Out 0 1974-01-24 1988-05-14
# 8 78122 Out 1 1990-02-24 1994-05-03
# 9 78122 Out 2 1994-07-22 1996-01-25
# 10 78123 Birth 0 1919-10-12 1919-10-12
# 11 78123 Death 0 2009-09-26 2009-09-26
# 12 78123 In 0 1966-01-18 1974-01-23
# 13 78123 In 1 1988-05-15 1990-02-23
# 14 78123 In 2 1994-05-04 1994-07-27
# 15 78123 In 3 1996-01-26 2009-09-01
# 16 78123 Out 0 1974-01-24 1988-05-14
# 17 78123 Out 1 1990-02-24 1994-05-03
# 18 78123 Out 2 1994-07-22 1996-01-25
注:d
为@jay.sf 提供的数据