在字符串向量中使用余弦相似性来过滤掉相似的字符串



我有一个字符串向量。向量的某些字符串(可能超过两个)在它们包含的单词方面彼此相似。我想过滤掉与向量的任何其他字符串的余弦相似度超过 30% 的字符串。在比较的两个字符串中,我希望保留更多单词的字符串。也就是说,我只想要那些与原始向量的任何字符串相似度低于 30% 的字符串作为结果。我的目标是过滤掉相似的字符串,只保留大致不同的字符串。

例如,向量是:

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")

结果应该给出(假设相似度小于 30%):

c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")

以上结果尚未得到验证。

我正在使用的余弦码:

CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting = weightTf)
    dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
    matrix_of_vector = as.matrix(dtm)
    res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])

我在RStudio工作。

因此,为了改写您想要的内容:您想计算所有字符串对的成对相似性。然后,您希望使用该相似性矩阵来标识足够相似的字符串组,以形成不同的组。对于其中每个组,您希望删除除最长字符串之外的所有字符串并返回该字符串。我说对了吗?

经过一些实验,这是我提出的解决方案,一步一步:

  • 计算相似性矩阵并使用阈值将其二值化
  • 使用igraph包中的图形算法识别不同的组(集团)
  • 查找每个组中的所有字符串并保留最长的字符串

注意:我不得不将阈值调整为 0.4 才能使您的示例正常工作。


相似性矩阵

这在很大程度上基于您提供的代码,但我将其打包为一个函数,并使用tidyverse使代码(至少按照我的口味)更具可读性。

library(tm)
library(lsa)
library(tidyverse)
get_cos_sim <- function(corpus) {
# pre-process corpus
doc <- corpus %>%
VectorSource %>%
tm::VCorpus()
# get term frequency matrix
tfm <- doc %>%
DocumentTermMatrix(
control = corpus %>% list(
removePunctuation = TRUE,
wordLengths = c(1, Inf),
weighting = weightTf)) %>%
as.matrix()
# get row-wise similarity
sim <- NULL
for(i in 1:nrow(tfm)) {
sim_i <- apply(
X = tfm, 
MARGIN = 1, 
FUN = lsa::cosine, 
tfm[i,])
sim <- rbind(sim, sim_i)
}
# set identity diagonal to zero
diag(sim) <- 0
# label and return
rownames(sim) <- corpus
return(sim)
}

现在我们将此函数应用于您的示例数据

# example corpus
strings <- c(
"Dan is a good man and very smart", 
"A good man is rare", 
"Alex can be trusted with anything", 
"Dan likes to share his food", 
"Rare are man who can be trusted", 
"Please share food")
# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4  

识别不同的组

事实证明这是一个有趣的问题!我找到了这篇论文,Chalermsook & Chuzhoy:最大独立矩形集,它引导我在igraph包中实现了这个实现。基本上,我们将相似的字符串视为图中的连接顶点,然后在整个相似性矩阵的图中寻找不同的组

library(igraph)
# create graph from adjacency matrix
cliques <- sim %>% 
dplyr::as_data_frame() %>%
mutate(from = row_number()) %>% 
gather(key = 'to', value = 'edge', -from) %>% 
filter(edge == T) %>%
graph_from_data_frame(directed = FALSE) %>%
max_cliques()

查找最长字符串

现在,我们可以使用集团列表来检索每个vertices的字符串,并为每个集团选择最长的字符串。警告:图中缺少语料库中没有类似字符串的字符串。我正在手动将它们重新添加。igraph包中可能有一个函数可以更好地处理它,如果有人找到一些东西,会感兴趣

# get the string indices per vertex clique first
string_cliques_index <- cliques %>% 
unlist %>%
names %>%
as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
as.numeric
# get a list with all indices
all_distict <- cliques %>% 
lapply(names) %>% 
lapply(as.numeric) %>%
c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)  

测试用例:

让我们用不同字符串的较长向量来测试这一点:

strings <- c(
"Dan is a good man and very smart", 
"A good man is rare", 
"Alex can be trusted with anything", 
"Dan likes to share his food", 
"Rare are man who can be trusted", 
"Please share food",
"NASA is a government organisation",
"The FBI organisation is part of the government of USA",
"Hurricanes are a tragedy",
"Mangoes are very tasty to eat ",
"I like to eat tasty food",
"The thief was caught by the FBI")

我得到这个二值化的相似性矩阵:

Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare                                     TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything                     FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food                           FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted                       FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food                                     FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation                     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Hurricanes are a tragedy                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat                         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
I like to eat tasty food                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
The thief was caught by the FBI                       FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

基于这些相似之处,预期结果将是:

# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat
# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI

实际输出具有正确的元素,但不按原始顺序排列。 不过,您可以使用原始字符串向量重新排序

[[1]]
[1] "The FBI organisation is part of the government of USA"
[[2]]
[1] "Dan is a good man and very smart"
[[3]]
[1] "Alex can be trusted with anything"
[[4]]
[1] "Dan likes to share his food"
[[5]]
[1] "Mangoes are very tasty to eat "
[[6]]
[1] "NASA is a government organisation"
[[7]]
[1] "Hurricanes are a tragedy"

就这样! 希望这是您正在寻找的,并且可能对其他人有用。

最新更新