我有两个数据帧。一个 (txt.df( 有一列，其中包含我想从中提取短语的文本(文本(。另一个(wrd.df(有一列包含短语(短语(。两者都是具有复杂文本和字符串的大数据帧，但让我们说：

txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice", 
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))

wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))

我最终需要的是一个带有另一列的txt.df，其中包含检测到的短语的标签。

我尝试在 wrd.df 中创建一个列，在其中标记这样的短语

wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))

然后尝试编写一个自定义函数来使用 grepl/str_detect 在令牌列上应用获取所有真实名称(标签(

Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}

然后在我的 txt.df$text 上应用此自定义函数，以拥有带有标签的新列。

txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))

但我不擅长自定义函数，真的很卡住。我将不胜感激任何帮助。附言如果我也能有"喝果汁"和"打破床"这样的部分比赛，那就太好了......但这不是优先事项...与原始的很好。

如果您需要匹配确切的短语，则需要fuzzyjoin-package 中的regex_join()。

fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )
id                                 text label              phrase
1  1              they love cats and dogs     a           love cats
2  2                 he is drinking juice  <NA>                <NA>
3  3 the child is having a nap on the bed     d      nap on the bed
4  4    they jump on the bed and break it  <NA>                <NA>
5  5       the cat is sleeping on the bed     d sleeping on the bed

如果你想匹配所有单词，我想你可以从涵盖这种行为的短语中构建一个正则表达式......

更新

#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "), 
function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
function(x) paste0( "^", x, ".*$") ) )

fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )
id                                 text label              phrase                                        regex
1  1              they love cats and dogs     a           love cats                     ^(?=.*love)(?=.*cats).*$
2  1              they love cats and dogs     b           love dogs                     ^(?=.*love)(?=.*dogs).*$
3  2                 he is drinking juice     c      juice drinking                ^(?=.*juice)(?=.*drinking).*$
4  3 the child is having a nap on the bed     d      nap on the bed      ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5  4    they jump on the bed and break it     e       break the bed            ^(?=.*break)(?=.*the)(?=.*bed).*$
6  5       the cat is sleeping on the bed     d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$

r语言 - 如何提取数据帧列中的所有匹配模式(字符串中的单词)?

更新

相关内容

最新更新

热门标签：