我有一个数据帧列,其中包含一个字符串,其中可能包含几个空格。我想在第一次出现关键字(即示例数据中的fruit_key
)后,在空格上使用tidyr
(或类似内容)中的separate
,以便我将一列分成两列。
示例数据
df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon",
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler",
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA,
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")
预期输出
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
我可以将带有separate
关键字后面的部分放入正确的列(即Tasty
),但无法为另一列(即Delicious
)返回实际关键字。我尝试了几次更改正则表达式,但永远无法获得正确的输出。
library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
# fruit Delicious Tasty
#1 Apple Orange Pineapple Orange Pineapple
#2 Plum Good Watermelon Watermelon
#3 Plum Good Kiwi Kiwi
#4 Plum Good Plum Good Plum Good
#5 Cantaloupe Melon Melon
#6 Blueberry Blackberry Cobbler Blackberry Cobbler
#7 Peach Pie Apple Pie Apple Pie
我知道我可以使用str_extract
和str_remove
(如下所示),但想使用类似separate
的东西在一个功能/步骤中完成。
library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))
这是一个整洁的解决方案,其中包含tidyr
的函数extract
:
library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
remove = FALSE)
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
在extract
的正则表达式参数中,我们将fruit_key
折叠成一个交替模式,我们将其括在括号中,以便将其识别为捕获组。第二个捕获组只是空格后面的任何内容。
如果我们需要将separate
与sep
一起使用,则创建一个正则表达式环顾 -"(?<=<fruit_key>) "
即在fruit_key单词后面的空格处拆分并且没有矢量化,collapse
成一个字符串,|
(str_c
)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(fruit, into = c("Delicious", "Tasty"),
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"),
extra = "merge", remove = FALSE)
-输出
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie