R:如何按另一列分组的 c() 嵌套字符向量?



我有字符串,其中包含按单词类型分组的单词枚举。为了简单起见,下面的示例只有一种类型。

ka = tibble(
words = c('apple, orange', 'pear, apple, plum'),
type = 'fruit'
)

我想找出每种类型的独特单词数量。

我想我会拆分字符向量,

ka = ka %>% 
mutate(
word_list = str_split(words, ', ')
)

然后绑定每组的列。最终结果将是

c(
ka$word_list[[1]],
ka$word_list[[2]],
)

然后我可以对这些向量进行唯一处理并得到它们的长度。

我不知道如何将列绑定在一起,按单独的列分组。我可以用循环中的丑陋循环来做到这一点,但也必须有一个映射/应用解决方案,遵循以下逻辑:

ka %>%
group_by(type) %>%
summarise(
biglist = map(word_list, ~ c(.)), # this doesn't work, obviously
biglist_unique = map(biglist, ~ unique(.)),
biglist_length = map(biglist_unique, ~ length(.))
)

这里有一个选项。首先,我们折叠向量,然后绘制出您要查找的内容。请注意,我们必须修剪空格以获得正确的唯一单词。

library(tidyverse)
ka %>%
group_by(type) %>%
summarise(all_words = paste(words, collapse = ",")) %>%
mutate(biglist = str_split(all_words, ",") %>% map(., ~str_trim(.x, "both")),
biglist_unique = map(biglist, ~.x[unique(.x)]),
biglist_length = map_dbl(biglist_unique, length))
#> # A tibble: 1 x 5
#>   type  all_words                       biglist   biglist_unique biglist_length
#>   <chr> <chr>                           <list>    <list>                  <dbl>
#> 1 fruit apple, orange,pear, apple, plum <chr [5]> <chr [4]>                   4

另一种选择是使用整洁的数据原则和tidyr包。

ka = ka %>% 
mutate(
word_list = str_split(words, ', ')
)
ka %>%
# If you need to maintain information about each row you can create an index
# mutate(index = row_number()) %>% 
# unnest the wordlist to get one word per row
unnest(word_list) %>%
# Only keep unique words per group
group_by(type) %>%
distinct(word_list, .keep_all = FALSE) %>% # if you need to maintain row info .keep_all = TRUE
summarise(n_unique = n())
# A tibble: 1 x 2
# type  n_unique
#     <chr>    <int>
#   1 fruit        4  

以下是您可以使用separate_rows的方法:

ka %>% 
separate_rows(words, sep = ', ') %>% 
group_by(type) %>% 
summarise(word_c = n_distinct(words))

像这样:

library(tidyverse)
ka %>% 
mutate(words = strsplit(as.character(words), ",")) %>% 
unnest(words) %>% 
mutate(words = gsub(" ","",words)) %>%
group_by(type) %>%
summarise(number =  n_distinct(words),
words = paste0(unique(words), collapse =' '))
# A tibble: 1 x 3
type  number words                 
<chr>  <int> <chr>                 
1 fruit      4 apple orange pear plum

最新更新