r语言 - 关键字后空格上的单独列 - r - Separate column on a space after keyword 小贝子编程网

我有一个数据帧列，其中包含一个字符串，其中可能包含几个空格。我想在第一次出现关键字(即示例数据中的fruit_key)后，在空格上使用tidyr(或类似内容)中的separate，以便我将一列分成两列。

示例数据

df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon", 
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler", 
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA, 
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")

预期输出

fruit   Delicious                Tasty
1       Apple Orange Pineapple       Apple     Orange Pineapple
2         Plum Good Watermelon   Plum Good           Watermelon
3               Plum Good Kiwi   Plum Good                 Kiwi
4          Plum Good Plum Good   Plum Good            Plum Good
5             Cantaloupe Melon  Cantaloupe                Melon
6 Blueberry Blackberry Cobbler   Blueberry   Blackberry Cobbler
7          Peach Pie Apple Pie   Peach Pie            Apple Pie

我可以将带有separate关键字后面的部分放入正确的列(即Tasty)，但无法为另一列(即Delicious)返回实际关键字。我尝试了几次更改正则表达式，但永远无法获得正确的输出。

library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
#                         fruit Delicious               Tasty
#1       Apple Orange Pineapple              Orange Pineapple
#2         Plum Good Watermelon                    Watermelon
#3               Plum Good Kiwi                          Kiwi
#4          Plum Good Plum Good                     Plum Good
#5             Cantaloupe Melon                         Melon
#6 Blueberry Blackberry Cobbler            Blackberry Cobbler
#7          Peach Pie Apple Pie                     Apple Pie

我知道我可以使用str_extract和str_remove(如下所示)，但想使用类似separate的东西在一个功能/步骤中完成。

library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))

这是一个整洁的解决方案，其中包含tidyr的函数extract：

library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
remove = FALSE)
fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

在extract的正则表达式参数中，我们将fruit_key折叠成一个交替模式，我们将其括在括号中，以便将其识别为捕获组。第二个捕获组只是空格后面的任何内容。

如果我们需要将separate与sep一起使用，则创建一个正则表达式环顾 -"(?<=<fruit_key>) "即在fruit_key单词后面的空格处拆分并且没有矢量化，collapse成一个字符串，|(str_c)

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
separate(fruit, into = c("Delicious", "Tasty"), 
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"), 
extra = "merge", remove = FALSE)

-输出

fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

r语言 - 关键字后空格上的单独列

相关内容

最新更新

热门标签：