我有字符串,其中包含按单词类型分组的单词枚举。为了简单起见,下面的示例只有一种类型。
ka = tibble(
words = c('apple, orange', 'pear, apple, plum'),
type = 'fruit'
)
我想找出每种类型的独特单词数量。
我想我会拆分字符向量,
ka = ka %>%
mutate(
word_list = str_split(words, ', ')
)
然后绑定每组的列。最终结果将是
c(
ka$word_list[[1]],
ka$word_list[[2]],
)
然后我可以对这些向量进行唯一处理并得到它们的长度。
我不知道如何将列绑定在一起,按单独的列分组。我可以用循环中的丑陋循环来做到这一点,但也必须有一个映射/应用解决方案,遵循以下逻辑:
ka %>%
group_by(type) %>%
summarise(
biglist = map(word_list, ~ c(.)), # this doesn't work, obviously
biglist_unique = map(biglist, ~ unique(.)),
biglist_length = map(biglist_unique, ~ length(.))
)
这里有一个选项。首先,我们折叠向量,然后绘制出您要查找的内容。请注意,我们必须修剪空格以获得正确的唯一单词。
library(tidyverse)
ka %>%
group_by(type) %>%
summarise(all_words = paste(words, collapse = ",")) %>%
mutate(biglist = str_split(all_words, ",") %>% map(., ~str_trim(.x, "both")),
biglist_unique = map(biglist, ~.x[unique(.x)]),
biglist_length = map_dbl(biglist_unique, length))
#> # A tibble: 1 x 5
#> type all_words biglist biglist_unique biglist_length
#> <chr> <chr> <list> <list> <dbl>
#> 1 fruit apple, orange,pear, apple, plum <chr [5]> <chr [4]> 4
另一种选择是使用整洁的数据原则和tidyr
包。
ka = ka %>%
mutate(
word_list = str_split(words, ', ')
)
ka %>%
# If you need to maintain information about each row you can create an index
# mutate(index = row_number()) %>%
# unnest the wordlist to get one word per row
unnest(word_list) %>%
# Only keep unique words per group
group_by(type) %>%
distinct(word_list, .keep_all = FALSE) %>% # if you need to maintain row info .keep_all = TRUE
summarise(n_unique = n())
# A tibble: 1 x 2
# type n_unique
# <chr> <int>
# 1 fruit 4
以下是您可以使用separate_rows
的方法:
ka %>%
separate_rows(words, sep = ', ') %>%
group_by(type) %>%
summarise(word_c = n_distinct(words))
像这样:
library(tidyverse)
ka %>%
mutate(words = strsplit(as.character(words), ",")) %>%
unnest(words) %>%
mutate(words = gsub(" ","",words)) %>%
group_by(type) %>%
summarise(number = n_distinct(words),
words = paste0(unique(words), collapse =' '))
# A tibble: 1 x 3
type number words
<chr> <int> <chr>
1 fruit 4 apple orange pear plum