r-使用dplyr选择筛选行之前的行



我正在进行一项研究,我们使用放置在巢箱内的相机来确定我们的研究物种何时产下第一个蛋。有些相机不是很可靠,我想看看在第一个鸡蛋出生之前是否有连续的照片。这样一来,我就不能确定这是第一次约会。有>165000张照片和>200个巢,所以我按巢箱ID分组,将行筛选到至少有1个蛋的行,然后使用切片功能选择第一行数据。这里有一个可重复的例子:

example <- structure(list(boxID = c("CA10", "CA10", "CA10", "CA10", "CA10", 
"CA10", "CA10", "CA10", "CA10", "CA10", "CA10", "CA10", "CA10", 
"CA10", "CA10"), visitType = c("Image", "Image", "Image", "Image", 
"Image", "Image", "Image", "Image", "Image", "Image", "Image", 
"Image", "Image", "Image", "Image"), day = c(25L, 25L, 25L, 26L, 
26L, 26L, 27L, 27L, 27L, 28L, 28L, 28L, 29L, 29L, 29L), month = c("MAR", 
"MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", 
"MAR", "MAR", "MAR", "MAR", "MAR"), year = c(2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2018), timeChecked = c("02:59", "09:06", "15:13", "02:59", "09:07", 
"15:14", "02:59", "09:07", "15:13", "02:58", "09:06", "15:12", 
"02:58", "09:06", "15:12"), species = c("Empty", "Empty", "Empty", 
"Empty", "Empty", "Empty", "Empty", "Empty", "American Kestrel", 
"Empty", "American Kestrel", "American Kestrel", "American Kestrel", 
"American Kestrel", "American Kestrel"), sexAdult = c(NA, NA, 
NA, NA, NA, NA, NA, NA, "Female", NA, "Female", "Female", "Female", 
NA, NA), numEggs = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
"1", "1", "1", "1", "1"), numNestlings = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_), date = structure(c(17615, 
17615, 17615, 17616, 17616, 17616, 17617, 17617, 17617, 17618, 
17618, 17618, 17619, 17619, 17619), class = "Date")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -15L), .Names = c("boxID", 
"visitType", "day", "month", "year", "timeChecked", "species", 
"sexAdult", "numEggs", "numNestlings", "date"))

这是我必须找到的第一行至少有1个鸡蛋的代码:

example %>%
mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>% # remove unknowns and other values that should be NA
select(boxID, date, numEggs, visitType) %>%
group_by(boxID) %>%
filter(numEggs > 0) %>%
slice(1) 

我想用一个鸡蛋看看第一行之前的5行或10行,以确保到目前为止有连续的数据。有没有一种方法可以用slice或其他dplyr函数进行行索引?

这里有一种方法。match返回第一个numEggs > 0的位置,然后我们简单地从该位置获得额外的n_previous行。我们使用max(1, ...),这样我们就不会在第一个numEggs > 0<n_previous的位置上出错。

n_previous <- 5
example %>%
mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>% 
select(boxID, date, numEggs, visitType) %>%
group_by(boxID) %>%
slice(max(1, match(TRUE, numEggs > 0) - n_previous):match(TRUE, numEggs > 0))
# A tibble: 6 x 4
# Groups:   boxID [1]
boxID date       numEggs visitType
<chr> <date>     <chr>   <chr>    
1 CA10  2018-03-26 <NA>    Image    
2 CA10  2018-03-27 <NA>    Image    
3 CA10  2018-03-27 <NA>    Image    
4 CA10  2018-03-27 <NA>    Image    
5 CA10  2018-03-28 <NA>    Image    
6 CA10  2018-03-28 1       Image  

这里是一种基于根据numEggs的第一个非缺失值的位置进行切片的方法。您可以根据要在第一个非NAnumEggs之前保留的行修改最后一行中的5值

example %>%
mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>%
select(boxID, date, numEggs, visitType) %>%
group_by(boxID) %>%
slice((min(which(!is.na(numEggs)))-5):min(which(!is.na(numEggs))))
# A tibble: 6 x 4
# Groups:   boxID [1]
boxID date       numEggs visitType
<chr> <date>     <chr>   <chr>    
1 CA10  2018-03-26 NA      Image    
2 CA10  2018-03-27 NA      Image    
3 CA10  2018-03-27 NA      Image    
4 CA10  2018-03-27 NA      Image    
5 CA10  2018-03-28 NA      Image    
6 CA10  2018-03-28 1       Image 

最新更新