使用sklearn查找具有大文档组的两个文本之间的字符串相似性

给定一大组文档（例如书名），如何比较不在原始文档集中的两本书名，或者不重新计算整个TF-IDF矩阵？

例如，

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
book_titles = ["The blue eagle has landed",
         "I will fly the eagle to the moon",
         "This is not how You should fly",
         "Fly me to the moon and let me sing among the stars",
         "How can I fly like an eagle",
         "Fixing cars and repairing stuff",
         "And a bottle of rum"]
vectorizer = TfidfVectorizer(stop_words='english', norm='l2', sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(book_titles)

为了检查第一本和第二本书名之间的相似性，可以进行

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

等等。这考虑到TF-IDF将针对矩阵中的所有条目来计算，因此权重将与令牌在所有语料库中出现的次数成比例。

现在我们假设应该比较两个标题，标题1和标题2，它们不在原始的一组书名中。这两个标题可以添加到book_titles集合中，然后进行比较，因此，例如，"rum"一词将被计算在内，包括前一语料库中的一个：

title1="The book of rum"
title2="Fly safely with a bottle of rum"
book_titles.append(title1, title2)
tfidf_matrix = vectorizer.fit_transform(book_titles)
index = tfidf_matrix.shape()[0]
cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])

如果文档变得非常大或需要存储在内存之外，那么什么才是真正不切实际且非常缓慢的。在这种情况下该怎么办？如果我只比较标题1和标题2，将不会使用以前的语料库。

为什么要将它们附加到列表中并重新计算所有内容？只做

new_vectors = vectorizer.transform([title1, title2])

相关内容

最新更新

热门标签：