在预处理和转换(BOW,TF-IDF(数据后,我需要计算其与数据集中其他元素的余弦相似性。目前,我这样做:
cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]
在此示例中,每个输入变量(例如 tr_title
(都是一个 SciPy 稀疏矩阵。但是,此代码运行速度非常慢。我该怎么做才能优化代码,使其运行得更快?
为了提高性能,您应该用矢量化代码替换列表推导式。这可以通过 Numpy 的pdist
和squareform
轻松实现,如下面的代码片段所示:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
titles = [
'A New Hope',
'The Empire Strikes Back',
'Return of the Jedi',
'The Phantom Menace',
'Attack of the Clones',
'Revenge of the Sith',
'The Force Awakens',
'A Star Wars Story',
'The Last Jedi',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))
演示:
In [87]: X
Out[87]:
<9x21 sparse matrix of type '<type 'numpy.int64'>'
with 30 stored elements in Compressed Sparse Row format>
In [88]: X.toarray()
Out[88]:
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)
In [89]: vectorizer.get_feature_names()
Out[89]:
[u'attack',
u'awakens',
u'back',
u'clones',
u'empire',
u'force',
u'hope',
u'jedi',
u'last',
u'menace',
u'new',
u'of',
u'phantom',
u'return',
u'revenge',
u'sith',
u'star',
u'story',
u'strikes',
u'the',
u'wars']
In [90]: np.set_printoptions(precision=2)
In [91]: print(cs_title)
[[ 0. 1. 1. 1. 1. 1. 1. 1. 1. ]
[ 1. 0. 0.75 0.71 0.75 0.75 0.71 1. 0.71]
[ 1. 0.75 0. 0.71 0.5 0.5 0.71 1. 0.42]
[ 1. 0.71 0.71 0. 0.71 0.71 0.67 1. 0.67]
[ 1. 0.75 0.5 0.71 0. 0.5 0.71 1. 0.71]
[ 1. 0.75 0.5 0.71 0.5 0. 0.71 1. 0.71]
[ 1. 0.71 0.71 0.67 0.71 0.71 0. 1. 0.67]
[ 1. 1. 1. 1. 1. 1. 1. 0. 1. ]
[ 1. 0.71 0.42 0.67 0.71 0.71 0.67 1. 0. ]]
请注意,X.toarray().shape
会产生(9L, 21L)
,因为在上面的玩具示例中,有 9 个标题和 21 个不同的单词,而 cs_title
是一个 9 x 9 数组。
通过考虑两个向量的余弦相似性的两个特征,您可以将每个计算的工作量减少一半以上:
- 向量与自身的余弦相似性为 1。 向量 x 与向量 y 的余弦相似性与向量
- y 与向量 x 的余弦相似性相同。
因此,计算对角线上方或下方的元素。
编辑:这是你如何计算它。特别要注意的是,cs 只是一个虚拟函数,用来代替相似系数的实际计算。
title1 = 'A four word title'
title2 = 'A five word title'
title3 = 'A six word title'
title4 = 'A seven word title'
titles = [title1, title2, title3, title4]
N = len(titles)
import numpy as np
similarity_matrix = np.array(N**2*[0],np.float).reshape(N,N)
cs = lambda a,b: 10*a+b # just a 'pretend' calculation of the coefficient
for m in range(N):
similarity_matrix [m,m] = 1
for n in range(m+1,N):
similarity_matrix [m,n] = cs(m,n)
similarity_matrix [n,m] = similarity_matrix [m,n]
print (similarity_matrix )
这是结果。
[[ 1. 1. 2. 3.]
[ 1. 1. 12. 13.]
[ 2. 12. 1. 23.]
[ 3. 13. 23. 1.]]