基于两列之间延迟的子集数据帧



假设您想要为保存行规则所基于的data.frame设置子集在行'a'和'b'之间的延迟:

# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
#output
a b
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1

本质上,如果'a' = 1想要保留该行以及随后运行的行b的值为1。这个捕获一直持续到a = 0 & b = 0的下一行。我试过使用嵌套的'ifelse()'语句,但我卡住了基于延迟问题合并逻辑测试。

建议吗?

我是这样做的。可能有一些选项需要少1或2行。

df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
library(dplyr)
df %>% 
mutate(grp = cumsum(a==1|a+b==0)) %>%
group_by(grp) %>%
filter(any(a == 1)) %>%
ungroup() %>%
select(a, b)

不使用dplyr的解决方案。使用标志:

# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# create new empty df
new_df <- read.table(text = "", col.names = c("a", "b"))
a_okay = FALSE # initialize the flag
for (row_number in seq(1:nrow(df))) { # loop over each row of the original df
# if a is 1, we add the row to the new df and set the flag to TRUE
if (df[row_number, "a"] == 1) {
a_okay = TRUE
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}

# now we consider the rows where a is not 1
else {
# if b is 1 and we are still following an a == 1: add the row
if (df[row_number, "b"] == 1 & a_okay) {
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# if b is 0, we reset the flag
else {
a_okay = FALSE
}
}
} 

另一个base解决方案,灵感来自这篇文章,@Wietse de Vries的回答和@Ben的评论。

# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# identify groups
df$grp <- cumsum(df$a == 1 | df$b == 0) 
# subset df by groups with first element of a == 1
df <- do.call(rbind, split(df, df$grp)[by(df, df$grp, function(x) {x$a[1] == 1})])
# remove grp
df$grp <- NULL

最新更新