在下面reprex的数据帧(tibble(中,有两组行我想按其进行分组。我几乎可以按doc_num
进行分组,但有一些行的doc_num
值不同,但属于同一组。具体来说,PO-100与15s搭配,PO-101与17s搭配。
我知道我下面的尝试失败了,因为我无法引用cur_data
调用中当前行中的doc_num
,因为它会认为我在引用cur_data
中的doc_num
列。但我不知道该怎么办。
保证每个不同的doc_num
值组,即每个grp
,只填充一个pur_ord
值。
预期输出是相同的example_df
,但有一个新列grp
,我可以稍后对其进行分组,其值为5个15秒,然后是6个17秒(即c(rep(15, 5), rep(17, 6)
(
我怎样才能做到这一点?
library(tidyverse)
set.seed(123)
example_df <- tibble(
doc_num = c(rep(15, 4), "PO-100", rep(17, 4), rep("PO-101", 2)),
pur_ord = c("Purchase Order #PO-100", rep(NA_character_, 4),
"Purchase Order #PO-101", rep(NA_character_, 5)),
x = rnorm(11),
y = sample(LETTERS, 11)
)
example_df
#> # A tibble: 11 × 4
#> doc_num pur_ord x y
#> <chr> <chr> <dbl> <chr>
#> 1 15 Purchase Order #PO-100 -0.560 Y
#> 2 15 <NA> -0.230 I
#> 3 15 <NA> 1.56 C
#> 4 15 <NA> 0.0705 H
#> 5 PO-100 <NA> 0.129 G
#> 6 17 Purchase Order #PO-101 1.72 J
#> 7 17 <NA> 0.461 Z
#> 8 17 <NA> -1.27 S
#> 9 17 <NA> -0.687 D
#> 10 PO-101 <NA> -0.446 N
#> 11 PO-101 <NA> 1.22 A
example_df %>%
mutate(
grp = case_when(
!str_starts(doc_num, "PO") ~ doc_num,
TRUE ~ cur_data() %>%
filter(str_detect(pur_ord, doc_num)) %>%
pull(doc_num)
)
)
#> Error in `mutate()`:
#> ! Problem while computing `grp = case_when(...)`.
#> Caused by error in `case_when()`:
#> ! `TRUE ~ cur_data() %>% filter(str_detect(pur_ord, doc_num)) %>%
#> pull(doc_num)` must be length 11 or one, not 0.
创建于2022-03-04由reprex包(v2.0.1(
这里有一个dplyr解决方案,即使数据没有严格排序,它也应该可以工作。
# 1. Group by each value of "doc_num" and find the unique "pur_ord".
# 2. Copy the PO number for each row of each group.
# 3. Ensure that this value is always present.
# 4. Use the PO number as a lookup into the original "doc_num" column.
new_df <- example_df %>%
group_by(doc_num) %>%
mutate(
po_col = max(pur_ord, na.rm = T),
po_col = gsub('.*(PO-\d+)', '\1', po_col),
po_col = ifelse(!is.na(po_col), po_col, doc_num)
) %>%
group_by(po_col) %>%
mutate(
grp = unique(doc_num[!grepl('PO', doc_num)])
) %>%
ungroup %>%
select(-po_col)
doc_num pur_ord x y grp
<chr> <chr> <dbl> <chr> <chr>
1 15 Purchase Order #PO-100 -0.560 Y 15
2 15 NA -0.230 I 15
3 15 NA 1.56 C 15
4 15 NA 0.0705 H 15
5 PO-100 NA 0.129 G 15
6 17 Purchase Order #PO-101 1.72 J 17
7 17 NA 0.461 Z 17
8 17 NA -1.27 S 17
9 17 NA -0.687 D 17
10 PO-101 NA -0.446 N 17
11 PO-101 NA 1.22 A 17
假设您的数据是结构化和有序的(即,每个所需组的PO-XXX总是排在最后(,您可以执行:
library(tidyverse)
example_df %>%
mutate(grp = if_else(str_detect(doc_num, '^PO-'), NA_character_, doc_num)) %>%
fill(grp)
# A tibble: 11 x 5
doc_num pur_ord x y grp
<chr> <chr> <dbl> <chr> <chr>
1 15 Purchase Order #PO-100 -0.560 Y 15
2 15 <NA> -0.230 I 15
3 15 <NA> 1.56 C 15
4 15 <NA> 0.0705 H 15
5 PO-100 <NA> 0.129 G 15
6 17 Purchase Order #PO-101 1.72 J 17
7 17 <NA> 0.461 Z 17
8 17 <NA> -1.27 S 17
9 17 <NA> -0.687 D 17
10 PO-101 <NA> -0.446 N 17
11 PO-101 <NA> 1.22 A 17