我希望这不会被标记为重复。我见过类似的堆栈溢出帖子,但我无法让它为我工作。
我的目标: 1st:我想在main_df中检测,auxiliary_df中的变量"Code"是否存在。 2nd:检测到后,我想创建一个列,其中包含识别的代码。例如,对于文本"学校表现,我希望有一行"A1,A6,A7"。
main_df <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Title Text
'School Performance' 'Students A1, A6 and A7 are great'
'Groceries Performance' 'Students A9, A3 are ok'
'Fruit Performance' 'A5 and A7 will be great fruit pickers'
'Jedi Performance' 'A3, A6, A5 will be great Jedis'
'Sith Performance' 'No one is very good. We should be happy.'")
auxiliary_df <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="FirstName Code
'Alex' 'A1'
'Figo' 'A6'
'Rui' 'A7'
'Deco' 'A5'
'Cristiano' 'A9'
'Ronaldo' 'A3'")
我尝试过:
toMatch <- auxiliary_df$Code
matches <- grep(paste(toMatch, collapse = "|"),
main_df$Title, value=TRUE)
matches #returns character(0)
我没有设法识别任何代码并将它们移动到新变量。
所需的输出如下所示:
"学校表现" "学生 A1、A6 和 A7很棒" "A1、A6、A7">
欢迎任何帮助!
您尝试与main_df$Title
匹配而不是main_df$Text
。您可以将gregexpr
与regmatches
一起使用来提取命中(主要使用您的代码)。
regmatches(main_df$Text, gregexpr(paste(auxiliary_df$Code, collapse = "|"),
main_df$Text))
#[[1]]
#[1] "A1" "A6" "A7"
#
#[[2]]
#[1] "A9" "A3"
#
#[[3]]
#[1] "A5" "A7"
#
#[[4]]
#[1] "A3" "A6" "A5"
#
#[[5]]
#character(0)
#
我们可以使用将所有Code
折叠成一个模式,并使用str_extract_all
提取Text
中出现的所有代码并将它们组合成一个逗号分隔的字符串。
main_df$extract_string <- sapply(stringr::str_extract_all(main_df$Text,
paste0('\b', auxiliary_df$Code, '\b', collapse = '|')), toString)
main_df
# Title Text extract_string
#1 School Performance Students A1, A6 and A7 are great A1, A6, A7
#2 Groceries Performance Students A9, A3 are ok A9, A3
#3 Fruit Performance A5 and A7 will be great fruit pickers A5, A7
#4 Jedi Performance A3, A6, A5 will be great Jedis A3, A6, A5
#5 Sith Performance No one is very good. We should be happy.
在模式中添加了单词边界(\b
),以便如果Text
中不存在A1
,则不会与A11
或A110
匹配。