r语言 - 单独(或类似功能)，多次出现或没有出现拆分字符 - r - separate (or similar function) with multiple or no occurrences of splitting character 小贝子编程网

我有这样的小妞

library("tidyverse")
tib <- tibble(x = c("lemon", "yellow, banana", "red, big, apple"))

我想创建两个名为description和fruit的新列，并使用separate提取逗号后的最后一个单词(如果有逗号;否则，我只想在单元格中复制单词(。

到目前为止，我已经

tib %>%
separate(x, ", ", into = c("description", "fruit"), remove = FALSE)

但这并没有完全按照我的意愿，产生：

# A tibble: 3 x 3
x               description fruit 
<chr>           <chr>       <chr> 
1 lemon           lemon       NA    
2 yellow, banana  yellow      banana
3 red, big, apple red         big   
Warning messages:
1: Expected 2 pieces. Additional pieces discarded in 1 rows [3]. 
2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].

我想要的输出是：

x               description fruit 
1 lemon           NA          lemon    
2 yellow, banana  yellow      banana
3 red, big, apple red, big    apple

有人可以指出我缺少的部分吗？

编辑

目标不必使用separate来实现。mutate也会起作用，解决方案同样值得赞赏！

使用extract可能会更好。在这里，我们可以使用捕获组将角色捕获为一个组。最好从末尾($(开始，然后倒退，即单词(\w+(在捕获的末尾，继承,或空格(\s(以及第一个捕获组中的所有其他字符((.*?)(

library(tidyr)
library(dplyr)
tib %>%
extract(x, into = c("description", "fruit"), remove = FALSE, '(.*?),?\s?(\w+$)')

或者使用正则表达式查找与separate，通过将分隔符指定为字符串的,后跟空格或字符串的开头(^后跟字符串末尾($(的单词(\w+

(

tib %>%
separate(x, into = c("description", 'fruit'),
remove = FALSE, '(, |^)(?=\w+$)') %>%
mutate(description = na_if(description, ""))

此外，separate的另一个选项是在最后一个单词之前插入一个新的分隔符，然后将其用作sep

library(stringr)
tib %>% 
mutate(x1 = str_replace(x, ',? ?(\w+)$', ";\1")) %>% 
separate(x1, into = c("description", "fruit"), sep=";") %>%
mutate(description = na_if(description, ""))
# A tibble: 3 x 3
#  x               description fruit 
#  <chr>           <chr>       <chr> 
#1 lemon           <NA>        lemon 
#2 yellow, banana  yellow      banana
#3 red, big, apple red, big    apple

基于正则表达式的解决方案，就像这里的其他两个一样，可能更好。但是，如果出于某种原因您想改用单词列表，这里有另一种选择。

将文本拆分为字符串列表。描述是除位置length(words)的项目之外的所有内容。水果是最后一项。如果可以使用空字符串而不是NA，则可以删除na_if位。

library(dplyr)
tib <- tibble(x = c("lemon", "yellow, banana", "red, big, apple"))
tib %>%
mutate(words = strsplit(x, ", "),
description = purrr::map_chr(words, ~paste(.[-length(.)], collapse = ", ")) %>% na_if(""),
fruit = purrr::map_chr(words, last))
#> # A tibble: 3 x 4
#>   x               words     description fruit 
#>   <chr>           <list>    <chr>       <chr> 
#> 1 lemon           <chr [1]> <NA>        lemon 
#> 2 yellow, banana  <chr [2]> yellow      banana
#> 3 red, big, apple <chr [3]> red, big    apple

显然，您可以删除words列 - 我将其保留只是为了显示其类型。

您可以使用正则表达式来获取描述 - 替换最后一个逗号及其后的所有内容。",[^,]+$"匹配逗号，后跟任何不是逗号的内容。

要获得水果，请使用stringr包的word功能来获取最后一个单词。

tib %>%
mutate(desc = if_else(grepl(",", x), sub(",[^,]+$", "", x), NA_character_),
fruit = stringr::word(x, -1))

r语言 - 单独(或类似功能)，多次出现或没有出现拆分字符

相关内容

最新更新

热门标签：