r语言 - 如何从多个文本中识别和检索多个模式?



我希望这不会被标记为重复。我见过类似的堆栈溢出帖子,但我无法让它为我工作。

我的目标: 1st:我想在main_df中检测,auxiliary_df中的变量"Code"是否存在。 2nd:检测到后,我想创建一个列,其中包含识别的代码。例如,对于文本"学校表现,我希望有一行"A1,A6,A7"。

main_df <- read.table(header = TRUE, 
stringsAsFactors = FALSE, 
text="Title Text
'School Performance' 'Students A1, A6 and A7 are great'
'Groceries Performance' 'Students A9, A3 are ok'
'Fruit Performance' 'A5 and A7 will be great fruit pickers'
'Jedi Performance' 'A3, A6, A5 will be great Jedis'
'Sith Performance' 'No one is very good. We should be happy.'")

auxiliary_df <- read.table(header = TRUE, 
stringsAsFactors = FALSE, 
text="FirstName Code
'Alex' 'A1'
'Figo' 'A6'
'Rui' 'A7'
'Deco' 'A5'
'Cristiano' 'A9'
'Ronaldo' 'A3'")

我尝试过:

toMatch <- auxiliary_df$Code
matches <- grep(paste(toMatch, collapse = "|"), 
main_df$Title, value=TRUE)
matches #returns character(0)

我没有设法识别任何代码并将它们移动到新变量。

所需的输出如下所示:

"学校表现" "学生 A1、A6 和 A7很棒" "A1、A6、A7">

欢迎任何帮助!

您尝试与main_df$Title匹配而不是main_df$Text。您可以将gregexprregmatches一起使用来提取命中(主要使用您的代码)。

regmatches(main_df$Text, gregexpr(paste(auxiliary_df$Code, collapse = "|"),
main_df$Text))
#[[1]]
#[1] "A1" "A6" "A7"
#
#[[2]]
#[1] "A9" "A3"
#
#[[3]]
#[1] "A5" "A7"
#
#[[4]]
#[1] "A3" "A6" "A5"
#
#[[5]]
#character(0)
#

我们可以使用将所有Code折叠成一个模式,并使用str_extract_all提取Text中出现的所有代码并将它们组合成一个逗号分隔的字符串。

main_df$extract_string <- sapply(stringr::str_extract_all(main_df$Text, 
paste0('\b', auxiliary_df$Code, '\b', collapse = '|')), toString)
main_df
#                  Title                                     Text extract_string
#1    School Performance         Students A1, A6 and A7 are great     A1, A6, A7
#2 Groceries Performance                   Students A9, A3 are ok         A9, A3
#3     Fruit Performance    A5 and A7 will be great fruit pickers         A5, A7
#4      Jedi Performance           A3, A6, A5 will be great Jedis     A3, A6, A5
#5      Sith Performance No one is very good. We should be happy.               

在模式中添加了单词边界(\b),以便如果Text中不存在A1,则不会与A11A110匹配。

最新更新