我正在通过计算语料库中每个文档与集群之间的距离来进行相似性排名工作。集群也作为文档列表给出。我遇到的问题是,我无法想出一种计算集群质心的正确方法,以便我可以计算相似性。我尝试使用集群的 tfidf 矩阵的平均值,但它给出的结果很差。
例如:我的集群是:
['Line a baking pan with a sheet of parchment paper.',
'Line the cake pan with parchment paper.',
'Line the bottom with parchment paper.',
'Line a baking pan with parchment paper.'
]
我的库尔普斯包含以下 3 个文档:
['Add vinegar and sugar.',
'Remove pan from heat and let stand 5 minutes.',
'Line the pan with parchment paper.'
]
我想计算每个文档和集群之间的相似性,这可能会产生如下结果:
[0.1, 0.1, 0.8]
你有什么建议吗?我尝试将集群和语料库文档表示为 tfidf 矩阵,但似乎很难通过计算两个矩阵之间的相似性来给出期望的结果。我尝试了LSI,但我要排名的是语料库,而不是集群文档,这迫使我找到代表集群的质心。
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
cluster = ['Line a baking pan with a sheet of parchment paper.',
'Line the cake pan with parchment paper.',
'Line the bottom with parchment paper.',
'Line a baking pan with parchment paper.']
corpus = ['Add vinegar and sugar.',
'Remove pan from heat and let stand 5 minutes.',
'Line the pan with parchment paper.']
# Train tfidf on cluster
tfidf = TfidfVectorizer()
tfidf_cluster = tfidf.fit_transform(cluster)
# Tranform the corpus using the trained tfidf
tfidf_corpus = tfidf.transform(corpus)
# Cosine similarity
cos_similarity = np.dot(tfidf_corpus, tfidf_cluster.T).A
avg_similarity = np.mean(cos_similarity, axis=1)
cos_similarity
Out[271]:
array([[0. , 0. , 0. , 0. ],
[0.31452723, 0.36145869, 0. , 0.43855558],
[0.50673521, 0.8242027 , 0.7139548 , 0.70655744]])
avg_similarity
Out[272]: array([0. , 0.27863537, 0.68786254])