查找文档的余弦相似性及其从R数据帧中删除



我正在处理数据框架,它只包含每行单据编号和文本的数据。此数据是从xml文件导出的。数据的形式为变量text_df:中的数据帧

行/文本

1 when uploading objective file bugzilla se
2 spelling mistake docs section searching fo…
3 editparams cgi won save updates iis instal…
4 editparams cgi won save updates            
5 rfe unsubscribe from bug you reported      
6 unsubscribe from bug you reported  

我正在使用以下代码来识别和删除重复项。

doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)
# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 
0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)
dtm1 = create_dtm(it1, vectorizer)
dtm2 = create_dtm(it2, vectorizer)
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
mat<-(d1_d2_cos_sim)
mat[lower.tri(mat,diag=TRUE)] <- 0
## for converting a sparse matrix into dataframe
mdf<- as.data.frame(as.matrix(mat))
datalist = list()
for (i in 1:nrow(mat)) {
t<-which(mat[i,]>0.8)
if(length(t)>1){
datalist[[i]] <- t # add it to your list
}
}
#Number of Duplicates Found
length(unique(unlist(datalist)))
tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))
# Removing the similar documents
text_df<-text_df[names(tmdf),]
nrow(text_df)

此代码需要花费大量时间来解决,欢迎提出任何改进建议。

quanteda在这种情况下运行得很好。下面我提供一个例子:

library(tibble)
library(quanteda)
df<- data_frame(text = c("when uploading objective file bugzilla se",
"spelling mistake docs section searching fo",
"editparams cgi won save updates iis instal",
"editparams cgi won save updates",
"rfe unsubscribe from bug you reported",
"unsubscribe from bug you reported"))
DocTerm <- quanteda::dfm(df$text)
textstat_simil(DocTerm, margin="documents", method = "cosine")
text1     text2     text3     text4     text5
text2 0.0000000                                        
text3 0.0000000 0.0000000                              
text4 0.0000000 0.0000000 0.8451543                    
text5 0.0000000 0.0000000 0.0000000 0.0000000          
text6 0.0000000 0.0000000 0.0000000 0.0000000 0.9128709
textstat_simil(DocTerm, margin="documents", method = "cosine")

如果一个人想将其子集化一个特定的数量,并查看哪些数量与特定的数量相似(在这个0.9中(,可以这样做:

mycosinesim<-textstat_simil(DocTerm, margin="documents", method = "cosine")
myMatcosine<-as.data.frame(as.matrix(mycosinesim))
higherthan90<-as.data.frame(which(myMatcosine>0.9,arr.ind = T, useNames = T))
higherthan90[which(higherthan90$row !=higherthan90$col),]
row col
text6     6   5
text5.1   5   6

现在你可以决定是否删除5或6作为文本,因为它们真的很相似

最新更新