在用scipy.cluster.hierarchy.linkage
对距离矩阵进行聚类,并使用scipy.cluster.hierarchy.cut_tree
将每个样本分配给一个簇之后,我想从每个簇中提取一个最接近该簇质心的元素。
- 如果有现成的功能,我将是最高兴的,但在缺乏: 这里已经提出了一些关于提取质心本身的建议,而不是最接近质心的元素。
- 注意不要与
scipy.cluster.hierarchy.linkage
中的centroid
联动规则混淆。我已经执行了集群本身,只是想访问最近的质心元素。
使用kd - tree最有效地计算最近邻。例如:
from scipy.spatial import cKDTree
def find_k_closest(centroids, data, k=1, distance_norm=2):
"""
Arguments:
----------
centroids: (M, d) ndarray
M - number of clusters
d - number of data dimensions
data: (N, d) ndarray
N - number of data points
k: int (default 1)
nearest neighbour to get
distance_norm: int (default 2)
1: Hamming distance (x+y)
2: Euclidean distance (sqrt(x^2 + y^2))
np.inf: maximum distance in any dimension (max((x,y)))
Returns:
-------
indices: (M,) ndarray
values: (M, d) ndarray
"""
kdtree = cKDTree(data, leafsize=leafsize)
distances, indices = kdtree.query(centroids, k, p=distance_norm)
if k > 1:
indices = indices[:,-1]
values = data[indices]
return indices, values
indices, values = find_k_closest(centroids, data)
Paul上面的解决方案对于多维数组非常有效。在更具体的情况下,如果你有一个距离矩阵dm
,其中距离是以"非平凡"的方式计算的(例如:每对对象首先在3D中对齐,然后计算RMSD),我最终从每个集群中选择与集群中其他元素的距离总和最小的元素,即。集群的mediid。(见下面的讨论。)这就是我在距离矩阵dm
和对象名称列表中以相同的顺序names
:
import numpy as np
import scipy.spatial.distance as spd
import scipy.cluster.hierarchy as sch
# Square form of distance matrix
sq=spd.squareform(dm)
# Perform clustering, capture linkage object
clusters=sch.linkage(dm,method=linkage)
# List of cluster assignments
assignments=sch.cut_tree(clusters,height=rmsd_cutoff)
# Store object names and assignments as zip object (list of tuples)
nameList=list(zip(names,assignments))
### Extract models closest to cluster centroids
counter=0
while counter<num_Clusters+1:
# Create mask from the list of assignments for extracting submatrix of the cluster
mask=np.array([1 if i==counter else 0 for i in assignments],dtype=bool)
# Take the index of the column with the smallest sum of distances from the submatrix
idx=np.argmin(sum(sq[:,mask][mask,:]))
# Extract names of cluster elements from nameList
sublist=[name for (name, cluster) in nameList if cluster==counter]
# Element closest to centroid
centroid=sublist[idx]