如何从R/tidyverse中的不同行获取mutate的值



在下面reprex的数据帧(tibble(中,有两组行我想按其进行分组。我几乎可以按doc_num进行分组,但有一些行的doc_num值不同,但属于同一组。具体来说,PO-100与15s搭配,PO-101与17s搭配。

我知道我下面的尝试失败了,因为我无法引用cur_data调用中当前行中的doc_num,因为它会认为我在引用cur_data中的doc_num列。但我不知道该怎么办。

保证每个不同的doc_num值组,即每个grp,只填充一个pur_ord值。

预期输出是相同的example_df,但有一个新列grp,我可以稍后对其进行分组,其值为5个15秒,然后是6个17秒(即c(rep(15, 5), rep(17, 6)(

我怎样才能做到这一点?

library(tidyverse)
set.seed(123)
example_df <- tibble(
doc_num = c(rep(15, 4), "PO-100", rep(17, 4), rep("PO-101", 2)),
pur_ord = c("Purchase Order #PO-100", rep(NA_character_, 4),
"Purchase Order #PO-101", rep(NA_character_, 5)),
x = rnorm(11),
y = sample(LETTERS, 11)
)
example_df
#> # A tibble: 11 × 4
#>    doc_num pur_ord                      x y    
#>    <chr>   <chr>                    <dbl> <chr>
#>  1 15      Purchase Order #PO-100 -0.560  Y    
#>  2 15      <NA>                   -0.230  I    
#>  3 15      <NA>                    1.56   C    
#>  4 15      <NA>                    0.0705 H    
#>  5 PO-100  <NA>                    0.129  G    
#>  6 17      Purchase Order #PO-101  1.72   J    
#>  7 17      <NA>                    0.461  Z    
#>  8 17      <NA>                   -1.27   S    
#>  9 17      <NA>                   -0.687  D    
#> 10 PO-101  <NA>                   -0.446  N    
#> 11 PO-101  <NA>                    1.22   A
example_df %>% 
mutate(
grp = case_when(
!str_starts(doc_num, "PO") ~ doc_num,
TRUE ~ cur_data() %>% 
filter(str_detect(pur_ord, doc_num)) %>% 
pull(doc_num)
)
)
#> Error in `mutate()`:
#> ! Problem while computing `grp = case_when(...)`.
#> Caused by error in `case_when()`:
#> ! `TRUE ~ cur_data() %>% filter(str_detect(pur_ord, doc_num)) %>%
#>   pull(doc_num)` must be length 11 or one, not 0.

创建于2022-03-04由reprex包(v2.0.1(

这里有一个dplyr解决方案,即使数据没有严格排序,它也应该可以工作。

# 1. Group by each value of "doc_num" and find the unique "pur_ord".
# 2. Copy the PO number for each row of each group.
# 3. Ensure that this value is always present.
# 4. Use the PO number as a lookup into the original "doc_num" column.

new_df <- example_df %>% 
group_by(doc_num) %>% 
mutate(
po_col = max(pur_ord, na.rm = T),
po_col = gsub('.*(PO-\d+)', '\1', po_col),
po_col = ifelse(!is.na(po_col), po_col, doc_num)
) %>% 
group_by(po_col) %>% 
mutate(
grp = unique(doc_num[!grepl('PO', doc_num)])
) %>% 
ungroup %>% 
select(-po_col)

doc_num pur_ord                      x y     grp  
<chr>   <chr>                    <dbl> <chr> <chr>
1 15      Purchase Order #PO-100 -0.560  Y     15   
2 15      NA                     -0.230  I     15   
3 15      NA                      1.56   C     15   
4 15      NA                      0.0705 H     15   
5 PO-100  NA                      0.129  G     15   
6 17      Purchase Order #PO-101  1.72   J     17   
7 17      NA                      0.461  Z     17   
8 17      NA                     -1.27   S     17   
9 17      NA                     -0.687  D     17   
10 PO-101  NA                     -0.446  N     17   
11 PO-101  NA                      1.22   A     17   

假设您的数据是结构化和有序的(即,每个所需组的PO-XXX总是排在最后(,您可以执行:

library(tidyverse)
example_df %>%
mutate(grp = if_else(str_detect(doc_num, '^PO-'), NA_character_, doc_num)) %>%
fill(grp)
# A tibble: 11 x 5
doc_num pur_ord                      x y     grp  
<chr>   <chr>                    <dbl> <chr> <chr>
1 15      Purchase Order #PO-100 -0.560  Y     15   
2 15      <NA>                   -0.230  I     15   
3 15      <NA>                    1.56   C     15   
4 15      <NA>                    0.0705 H     15   
5 PO-100  <NA>                    0.129  G     15   
6 17      Purchase Order #PO-101  1.72   J     17   
7 17      <NA>                    0.461  Z     17   
8 17      <NA>                   -1.27   S     17   
9 17      <NA>                   -0.687  D     17   
10 PO-101  <NA>                   -0.446  N     17   
11 PO-101  <NA>                    1.22   A     17 

最新更新