在r中使用Group by with mutate, case_when, any()和all()函数



我在每个阶段都有一个具有id和状态的status_df:

<表类> id 阶段 状态 tbody><<tr>151在152不发送161批准162拒绝163不发送164不发送201批准202批准203批准

您的尝试方向是正确的,但是,在比较(==)之前,您提前关闭了any/all括号。此外,由于您只需要每个id有一行,您可以使用summarise而不是mutate,这也将避免使用select

library(dplyr)
status_df %>% 
group_by(id) %>%
summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
any(status == "Rejected") ~ "Rejected", 
all(status == "Approved") ~ "Approved"))
#    id final_status
#* <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

我们可以使用summarise而不是mutate(因为mutate返回的输出列与输入列的length相同,并且它用于创建/修改列而不是汇总)。

另外,一个更简单的选择是用自定义顺序指定的levels转换为factor,删除未使用的级别(droplevels)并在按'id'分组后选择firstlevels

library(dplyr)
status_df %>%
group_by(id) %>%
summarise(final_status = first(levels(droplevels(factor(status, 
levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')

与产出

# A tibble: 3 x 2
#     id final_status
#  <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

在OP的代码中,any(status)返回NA,而不是它应该包装在一个逻辑向量上,即any(status == "Pending")。此外,如上所述,它将是summarise而不是mutate

数据
status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L, 
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending", 
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved", 
"Approved", "Approved")), class = "data.frame", row.names = c(NA, 
-9L))

最新更新