在R组服务中订购数据帧中某个范围内的编号



我有一个服务的数据帧。现在我需要添加一列";订单;并用以下规则对它们进行分组:

将服务分组到订单:如果在接下来的5个值内一个服务值"0";A";是另一种服务";A";目前,将所有值填充到订单ID中,也可以填充没有服务值的值。如果在接下来的5个值中没有服务值,则定义下一个订单组。

dput(数据(

structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
14, 15, 16), time = structure(1:15, .Label = c("13:20:01", "13:20:02", 
"13:20:03", "13:20:04", "13:20:05", "13:20:06", "13:20:07", "13:20:08", 
"13:20:09", "13:20:10", "13:20:11", "13:20:12", "13:20:13", "13:20:14", 
"13:20:15"), class = "factor"), apples = c(2, 2, 2, 3, 3, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2), service = structure(c(NA, 1L, 1L, 
NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, 1L), .Label = "A", class = "factor")), class = "data.frame", row.names = c(NA, 
-15L))

概述

id   time       apples   service
1    13:20:01   2        
2    13:20:02   2        A         
3    13:20:03   2        A         
4    13:20:04   3                  
5    13:20:05   3        A         
6    13:20:06   2                  
7    13:20:07   2        A         
8    13:20:08   2                 
9    13:20:09   2                  
10   13:20:10   2                 
11   13:20:11   2                 
12   13:20:12   2
14   13:20:13   2        A         
15   13:20:14   2                  
16   13:20:15   2        A         

这就是我要找的格式。ID 2到ID 8是订单,ID 14到ID 16。

id   time       apples   service  Order
1    13:20:01   2        
2    13:20:02   2        A         1
3    13:20:03   2        A         1
4    13:20:04   3                  1
5    13:20:05   3        A         1
6    13:20:06   2                  1
7    13:20:07   2        A         1
8    13:20:08   2                 
9    13:20:09   2                  
10   13:20:10   2                 
11   13:20:11   2                 
12   13:20:12   2
14   13:20:13   2        A         2
15   13:20:14   2                  2
16   13:20:15   2        A         2

我用for循环试过了。我建议有一种方法可以使用突变方法;范围";conditon。

谢谢你的帮助!

这是我的输出,由tspano 的代码产生

# A tibble: 15 x 11
id time     apples service start   end g0       g1 g2       g3 order
<dbl> <fct>     <dbl> <fct>   <dbl> <dbl> <chr> <int> <chr> <int> <int>
1     1 13:20:01      2 NA          0     3 NA        0 NA        0    NA
2     2 13:20:02      2 A           1     3 start     1 NA        0    NA
3     3 13:20:03      2 A           2     3 NA        1 NA        0    NA
4     4 13:20:04      3 NA          2     2 NA        1 NA        0    NA
5     5 13:20:05      3 A           3     2 NA        1 NA        0    NA
6     6 13:20:06      2 NA          3     1 NA        1 NA        0    NA
7     7 13:20:07      2 A           3     1 NA        1 NA        0    NA
8     8 13:20:08      2 NA          2     0 end       2 NA        0    NA
9     9 13:20:09      2 NA          2     1 NA        2 NA        0    NA
10    10 13:20:10      2 NA          1     1 NA        2 NA        0    NA
11    11 13:20:11      2 NA          1     2 NA        2 NA        0    NA
12    12 13:20:12      2 NA          0     2 NA        2 NA        0    NA
13    14 13:20:13      2 A           1     2 start     3 NA        0    NA
14    15 13:20:14      2 NA          1     1 NA        3 NA        0    NA
15    16 13:20:15      2 A           2     1 NA        3 NA        0    NA

这里有一个使用RcppRoll的解决方案,它应该比R for loop:更快

data %>% 
mutate(start = RcppRoll::roll_sum(c(rep(F,4),(service=="A") %in% T), n = 5, align = "right"),
end = RcppRoll::roll_sum(c((service=="A") %in% T, rep(F,4)), n = 5, align = "left"),
g0 = case_when(start>0 & (lag(start)==0) %in% c(T,NA) ~ "start",
end ==0 ~ "end",
T ~ NA_character_)
) %>% 
group_by(g1 = cumsum(!is.na(g0))) %>% 
mutate(g2 = if_else(first(g0)=="end", NA_character_, "order")) %>% 
ungroup() %>% 
group_by(g3 = cumsum(!is.na(g2) & is.na(lag(g2))) ) %>% 
mutate(order = if_else(is.na(g2), NA_integer_, g3)) %>% 
ungroup() %>%
select(id, time, apples, service, order)

如果你去掉最后一个select,你可以看到我有几个中间结果,应该会让逻辑变得清晰。

最新更新