如何使用R提取字符串向量在另一个字符串向量中的外观



我有一个字符串向量,如下所示:

strings <- tibble(string = c("apple, orange, plum, tomato",
"plum, beat, pear, cactus",
"centipede, toothpick, pear, fruit"))

我有一个水果载体:

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

我想要的是一个带有原始strings数据的data.frame/tibble,带有该原始列中包含的所有水果的第二个列表或字符列。像这样的东西。

strings <- tibble(string = c("apple, orange, plum, tomato",
"plum, beat, pear, cactus",
"centipede, toothpick, pear, fruit"),
match = c("apple, orange, plum",
"plum, pear",
"pear")
)

我尝试过str_extract(strings, fruits),得到一个列表,其中所有内容都是空白的,并显示警告:

Warning message:
In stri_detect_regex(string, pattern, opts_regex = opts(pattern)):
longer object length is not a multiple of shorter object length

我已经尝试过str_extract_all(strings, paste0(fruits, collapse = "|")),但我得到了相同的警告消息。

我看过这个在另一个字符串向量中查找字符串向量的匹配项,但这在这里似乎没有帮助。

如有任何帮助,我们将不胜感激。

这里有一个选项。首先,我们将string列的每一行拆分为单独的字符串(现在"apple, orange, plum, tomato"都是一个字符串(。然后,我们将字符串列表与fruits$fruit列的内容进行比较,并在新的fruits列中存储匹配值的列表。

library("tidyverse")
strings <- tibble(
string = c(
"apple, orange, plum, tomato",
"plum, beat, pear, cactus",
"centipede, toothpick, pear, fruit"
)
)
fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))
strings %>%
mutate(str2 = str_split(string, ", ")) %>%
rowwise() %>%
mutate(fruits = list(intersect(str2, fruits$fruit)))
#> Source: local data frame [3 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 3
#>   string                            str2      fruits   
#>   <chr>                             <list>    <list>   
#> 1 apple, orange, plum, tomato       <chr [4]> <chr [3]>
#> 2 plum, beat, pear, cactus          <chr [4]> <chr [2]>
#> 3 centipede, toothpick, pear, fruit <chr [4]> <chr [1]>

创建于2018-08-07由reprex包(v0.2.0(。

下面是一个使用purrr 的示例

strings <- tibble(string = c("apple, orange, plum, tomato",
"plum, beat, pear, cactus",
"centipede, toothpick, pear, fruit"))
fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))
extract_if_exists <- function(string_to_parse, pattern){
extraction <- stringi::stri_extract_all_regex(string_to_parse, pattern)
extraction <- unlist(extraction[!(is.na(extraction))])
return(extraction)
}
strings %>%
mutate(matches = map(string, extract_if_exists, fruits$fruit)) %>%
mutate(matches = map(string, str_c, collapse=", ")) %>%
unnest

这里有一个base-R解决方案:

strings[["match"]] <- 
sapply(
strsplit(strings[["string"]], ", "), 
function(x) {
paste(x[x %in% fruits[["fruit"]]], collapse = ", ")
}
)

结果:

string                            match              
<chr>                             <chr>              
1 apple, orange, plum, tomato       apple, orange, plum
2 plum, beat, pear, cactus          plum, pear         
3 centipede, toothpick, pear, fruit pear               

最新更新