id current stage previous stages
1 06 05
1 06 03
2 04 03
2 04 02
假设有5个ID阶段。(02,03等(ID应该通过每个阶段。在示例ID NUM 1跳过04和02阶段中,ID NUM 2通过所有阶段。因此,它应该是当前阶段-1和-2等...
我必须识别跳过阶段的此类ID。需要进行r或hadoop查询。
如果我正确理解了问题,则可以在dplyr
解决方案下尝试。
library(dplyr)
df %>%
group_by(id, current_stage) %>%
summarise(all_prev_stages = paste(sort(previous_stages, decreasing = T), collapse = ",")) %>%
mutate(posible_prev_stages = paste(seq(current_stage-1, 2), collapse = ",")) %>%
filter(all_prev_stages != posible_prev_stages) %>%
select(id)
这给出了跳过阶段的ID列表(即示例数据中的id = 1
(:
id
1 1
样本数据:
df <- structure(list(id = c(1L, 1L, 2L, 2L), current_stage = c(6L,
6L, 4L, 4L), previous_stages = c(5L, 3L, 3L, 2L)), .Names = c("id",
"current_stage", "previous_stages"), class = "data.frame", row.names = c(NA,
-4L))