如何使用距离矩阵和kmedoids将新观测值分配给聚类?

我有一个数据框架，它保存了数据框架中每个文档之间的Word Mover's Distance。我正在运行kmediods来生成集群。

1      2     3      4      5   
1  0.00   0.05  0.07   0.04   0.05
2  0.05   0.00  0.06   0.04   0.05
3. 0.07   0.06  0.00   0.06   0.06
4  0.04   0.04. 0.06   0.00   0.04
5  0.05   0.05  0.06   0.04   0.00
kmed = KMedoids(n_clusters= 3, random_state=123, method  ='pam').fit(distance)

在运行这个初始矩阵并生成簇之后，我想添加新的要聚类的点。在向距离矩阵中添加新文档后，我最终得到:

1      2     3      4      5      6
1  0.00   0.05  0.07   0.04   0.05   0.12
2  0.05   0.00  0.06   0.04   0.05   0.21 
3. 0.07   0.06  0.00   0.06   0.06   0.01
4  0.04   0.04. 0.06   0.00   0.04   0.05
5  0.05   0.05  0.06   0.04   0.00   0.12
6. 0.12   0.21  0.01   0.05   0.12   0.00

我试过使用kmed。

kmed.predict(new_distance.loc[-1: ])

然而，这给了我一个不兼容的维度X.shape[1] == 6而Y.shape[1] == 5的错误。

我如何使用新文档的这个距离来确定它应该属于哪个集群?这是可能的吗?还是我每次都必须重新计算集群?谢谢!

k- medioids的源代码如下:

def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), 
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
"""

我假设您使用precomputed度量(因为您计算了分类器之外的距离)，因此在您的示例中，n_query是新文档的数量，n_indexed是调用fit方法的文档的数量。

在您的特殊情况下，当您在5个文档上拟合模型，然后想要对第6个文档进行分类时，用于分类的X应该具有形状(1,5)，可以计算为

kmed.predict(new_distance.loc[-1: , :-1])

这是我的试验，每次我们必须重新计算新点和旧点之间的距离。

import pandas as pd
from sklearn_extra.cluster import KMedoids
from  sklearn.metrics import pairwise_distances
import numpy as np
# dummy data for trial
df = pd.DataFrame({0: [0,1],1 : [1,2]})
# calculatie distance
distance = pairwise_distances(df.values, df.values)
# fit model
kmed = KMedoids(n_clusters=2, random_state=123, method='pam').fit(distance)
new_point = [2,3]
distance = pairwise_distances(np.array(new_point).reshape(1, -1), df.values)
#calculate the distance between the new point and the initial dataset
print(distance)
#get ride of the last element which is the ditance of the new point with itself
print(kmed.predict(distance[0][:2].reshape(1, -1)))

相关内容

最新更新

热门标签：