我有一个数据框,需要创建一个标志来指示 2 列之间存在部分匹配的实例 以下是代码和一些虚拟数据:
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies")
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)
预期结果是相同的数据框,但有一个附加列,用于显示单词和文本之间的匹配是否为部分匹配
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup")
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)
我试过了
str_detect(mydata$word, mydata$text)
以及使用Charmatch,Pmatch,grep和greple等函数的类似东西,但没有成功。
实际数据包含数千条记录,因此解决方案应扩展。
谢谢。
经过长时间的尝试,我学到了更多关于字符串操作的知识并得到了它。可能不是最有效的方法,但它奏效了。
OBS:我用"¹"、"²"和"³"标记了注释,以便稍后解释。
parcial.m = numeric() # Create an empty vector
for(i in 1:nrow(mydata2)){
pattern = paste("([^n]*)(",mydata2$word[i],")([^n]*)",sep="")
# ¹
split = unlist(strsplit(mydata2$text[i], "[ [:punct:]]"))
# Split the text by punctuation and spaces, i.e. by words
word = grep(mydata2$word[i], split, value=TRUE)
# Select only the 'original' word
if(length(grep(mydata2$word[i], word))==0) {parcial.m[i]=0}
# ²
else {parcial.m[i] = !((gsub(pattern, "\1" , word)=="") & (gsub(pattern, "\3" , word)==""))}}
# ³
¹:模式是:除换行以外的任何字符的 0 或更多(因此*
)的组(用(...)
标记)(因此^n
,n
是新行,^
是除它之外的所有内容),后跟一个带有搜索单词的组,以及等于第一个的第三个。
²:如果根本没有匹配项,则表示我们没有部分匹配项,因此我们需要值为 0。我们通过使用以下事实来选择这些情况,当没有匹配项时,grep(mydata2$word[i], word)
将返回长度为 0 的数字。
³:"\1"
和"\3"
选择模式的第 1 和第 3 个预先提到的组。 如果这是一个完美的匹配,在我们"拿走"搜索的单词(第 2 组)后,不会有任何word
的"剩余"(我称之为"原始单词"),因此组 1 和 3 将为空(即 =""
)。该行代码正在测试两个组是否同时为空(完全匹配),并对其进行否定(因此 !)。由于我们已经使用 if 语句将无匹配标记为 0,因此剩下的就是部分匹配。