我有一个列名称为ID,日期,覆盖率的数据集。每个ID都有不同数量的日期。覆盖范围为整数0-3。我想过滤这个数据集,使最早和最后一个时间点的覆盖率==3。输入例子:
ID date coverage
001 2012-12-24 2
001 2013-12-04 3
001 2014-12-14 1
001 2015-12-02 3
001 2016-12-02 0
002 2012-01-15 3
002 2013-11-15 1
002 2014-11-15 3
003 2019-01-15 1
003 2020-11-15 1
003 2021-11-15 3
示例输出:
ID date coverage
001 2013-12-04 3
001 2014-12-14 1
001 2015-12-02 3
002 2012-01-15 3
002 2013-11-15 1
002 2014-11-15 3
003 2021-11-15 3
我们将arrange
的'ID', '日期',按'ID'分组,slice
的行从第一个覆盖3的值到last
。注意,如果覆盖范围内没有3个值,我们可能需要一个条件,通过if/else
条件(else
返回NULL
)来删除id(或者如果我们想要这些id的完整数据行,使用row_number()
)
library(dplyr)
df1 %>%
arrange(ID, date) %>%
group_by(ID) %>%
slice(if(3 %in% coverage)
match(3, coverage):last(which(coverage == 3)) else NULL) %>%
# if we want to keep the full rows
# slice(if(3 %in% coverage)
# match(3, coverage):last(which(coverage == 3)) else row_number()) %>%
ungroup
与产出
# A tibble: 7 × 3
ID date coverage
<int> <date> <int>
1 1 2013-12-04 3
2 1 2014-12-14 1
3 1 2015-12-02 3
4 2 2012-01-15 3
5 2 2013-11-15 1
6 2 2014-11-15 3
7 3 2021-11-15 3
数据df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), date = structure(c(15698, 16043, 16418, 16771, 17137, 15354,
16024, 16389, 17911, 18581, 18946), class = "Date"), coverage = c(2L,
3L, 1L, 3L, 0L, 3L, 1L, 3L, 1L, 1L, 3L)), row.names = c(NA, -11L
), class = "data.frame")