我在调用余弦_相似性时遇到以下错误
numerator = sum(a*b for a,b in zip(x,y))
TypeError: only integer arrays with one element can be converted to an index
我正在尝试从CountVectorizer返回的文档关键字矩阵中获取关键字关键字共生矩阵。
我觉得cosine_similarity
不喜欢我传递的数据类型,但我不确定具体是什么问题。这里,n
是scipy.sparse.csc.csc_matrix
类型,y
是scipy.sparse.csr.csr_matrix
类型
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
countvectorizer = CountVectorizer()
y = countvectorizer.fit_transform(documents)
n = y.T.dot(y)
x = n.tocsr()
x = x.toarray()
numpy.fill_diagonal(x, 0)
result = cosine_similarity(x, "None")
使用sklearn
cosine_similarity
运行此代码段并返回一个看起来合理的答案。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import distance_metrics
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
countvectorizer = CountVectorizer()
y = countvectorizer.fit_transform(documents)
n = y.T.dot(y)
x = n.tocsr()
x = x.toarray()
np.fill_diagonal(x, 0)
cosine_similarity = distance_metrics()['cosine']
result = cosine_similarity(x, x)