r语言 - 关键字后空格上的单独列



我有一个数据帧列,其中包含一个字符串,其中可能包含几个空格。我想在第一次出现关键字(即示例数据中的fruit_key)后,在空格上使用tidyr(或类似内容)中的separate,以便我将一列分成两列。

示例数据

df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon", 
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler", 
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA, 
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")

预期输出

fruit   Delicious                Tasty
1       Apple Orange Pineapple       Apple     Orange Pineapple
2         Plum Good Watermelon   Plum Good           Watermelon
3               Plum Good Kiwi   Plum Good                 Kiwi
4          Plum Good Plum Good   Plum Good            Plum Good
5             Cantaloupe Melon  Cantaloupe                Melon
6 Blueberry Blackberry Cobbler   Blueberry   Blackberry Cobbler
7          Peach Pie Apple Pie   Peach Pie            Apple Pie

我可以将带有separate关键字后面的部分放入正确的列(即Tasty),但无法为另一列(即Delicious)返回实际关键字。我尝试了几次更改正则表达式,但永远无法获得正确的输出。

library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
#                         fruit Delicious               Tasty
#1       Apple Orange Pineapple              Orange Pineapple
#2         Plum Good Watermelon                    Watermelon
#3               Plum Good Kiwi                          Kiwi
#4          Plum Good Plum Good                     Plum Good
#5             Cantaloupe Melon                         Melon
#6 Blueberry Blackberry Cobbler            Blackberry Cobbler
#7          Peach Pie Apple Pie                     Apple Pie

我知道我可以使用str_extractstr_remove(如下所示),但想使用类似separate的东西在一个功能/步骤中完成。

library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))

这是一个整洁的解决方案,其中包含tidyr的函数extract

library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
remove = FALSE)
fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

extract的正则表达式参数中,我们将fruit_key折叠成一个交替模式,我们将其括在括号中,以便将其识别为捕获组。第二个捕获组只是空格后面的任何内容。

如果我们需要将separatesep一起使用,则创建一个正则表达式环顾 -"(?<=<fruit_key>) "即在fruit_key单词后面的空格处拆分并且没有矢量化,collapse成一个字符串,|(str_c)

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
separate(fruit, into = c("Delicious", "Tasty"), 
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"), 
extra = "merge", remove = FALSE)

-输出

fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

最新更新