r语言 - 如何根据序列的开始和结束创建序列列



我有两列包含有关序列开始和结束的信息。我想从中创建一个序列列,即每个序列在seq_start1时开始,并在seq_start = 1seq_end = 1之后出现的第一行结束。我怎样才能用tidyverse做到这一点?数据如下所示,其中seq是预期输出。请注意,当seq_end = 1seq_start = 1在同一行中时,这将产生长度为一的序列。

structure(list(seq_start = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, 
NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 
NA, 1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, 
NA), seq_end = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 
1L, 1L, 1L, NA, NA, 1L, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L, 
1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA, NA, NA, 1L, 1L, NA, 
NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, NA, 1L, 1L, 
1L), seq = c(NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
NA, 3L, NA, NA, NA, NA, NA, NA, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
7L, 7L, 7L, 8L, NA, NA, NA, 9L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 10L, 10L, NA, NA, NA, NA, NA, NA, NA, 11L, 
NA)), .Names = c("seq_start", "seq_end", "seq"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -60L))

这是一个解决方案,它大量使用dplyr包的lag()函数以及base包中的cumsum()来产生预期的结果。这可能不是最简洁的解决方案,但我确实认为理解起来相当直观:

d <- d %>%
# new.seq.starts starts from 0, and increments by 1 every time seq_starts takes on 
# the value 1, like this: 0, 0, 0, 1, 1, 1, 1, 2, 2, ...
# Rows with the same new.seq.starts value are thus part of the same "run".
mutate(new.seq.starts = cumsum(!is.na(seq_start))) %>%
# group by each "run"
group_by(new.seq.starts) %>%
# any.ending.so.far counts whether there has been ANY seq_end == 1 within the run yet.
# first.ending is TRUE only if it's the first row (within the run) to have an ending.
mutate(any.ending.so.far = cumsum(!is.na(seq_end)),
first.ending = any.ending.so.far == 1 &
(is.na(lag(any.ending.so.far)) | lag(any.ending.so.far) < 1)) %>%
ungroup() %>%
# result keeps the new.seq.starts values only if there's no ending yet (i.e. 
# any.ending.so.far == 0), or only just ended (first.ending == TRUE). Otherwise,
# it takes on the value NA.
mutate(result = ifelse(new.seq.starts > 0 &
(any.ending.so.far == 0 | first.ending),
new.seq.starts, NA)) %>%
# Remove helper variables as they are no longer needed.
select(-c(new.seq.starts, any.ending.so.far, first.ending))
> all.equal(d$seq, d$result)
[1] TRUE

最新更新