找到最靠近集群质心的元素



在用scipy.cluster.hierarchy.linkage对距离矩阵进行聚类,并使用scipy.cluster.hierarchy.cut_tree将每个样本分配给一个簇之后,我想从每个簇中提取一个最接近该簇质心的元素。

  • 如果有现成的功能,我将是最高兴的,但在缺乏:
  • 这里已经提出了一些关于提取质心本身的建议,而不是最接近质心的元素。
  • 注意不要与scipy.cluster.hierarchy.linkage中的centroid联动规则混淆。我已经执行了集群本身,只是想访问最近的质心元素。

使用kd - tree最有效地计算最近邻。例如:

from scipy.spatial import cKDTree
def find_k_closest(centroids, data, k=1, distance_norm=2):
    """
    Arguments:
    ----------
        centroids: (M, d) ndarray
            M - number of clusters
            d - number of data dimensions
        data: (N, d) ndarray
            N - number of data points
        k: int (default 1)
            nearest neighbour to get
        distance_norm: int (default 2)
            1: Hamming distance (x+y)
            2: Euclidean distance (sqrt(x^2 + y^2))
            np.inf: maximum distance in any dimension (max((x,y)))
    Returns:
    -------
        indices: (M,) ndarray
        values: (M, d) ndarray
    """
    kdtree = cKDTree(data, leafsize=leafsize)
    distances, indices = kdtree.query(centroids, k, p=distance_norm)
    if k > 1:
        indices = indices[:,-1]
    values = data[indices]
    return indices, values
indices, values = find_k_closest(centroids, data)

Paul上面的解决方案对于多维数组非常有效。在更具体的情况下,如果你有一个距离矩阵dm,其中距离是以"非平凡"的方式计算的(例如:每对对象首先在3D中对齐,然后计算RMSD),我最终从每个集群中选择与集群中其他元素的距离总和最小的元素,即集群的mediid。(见下面的讨论。)这就是我在距离矩阵dm和对象名称列表中以相同的顺序names:

所做的。
import numpy as np
import scipy.spatial.distance as spd
import scipy.cluster.hierarchy as sch
# Square form of distance matrix
sq=spd.squareform(dm)
# Perform clustering, capture linkage object
clusters=sch.linkage(dm,method=linkage)
# List of cluster assignments
assignments=sch.cut_tree(clusters,height=rmsd_cutoff)
# Store object names and assignments as zip object (list of tuples)
nameList=list(zip(names,assignments))
### Extract models closest to cluster centroids
counter=0
while counter<num_Clusters+1:
    # Create mask from the list of assignments for extracting submatrix of the cluster
    mask=np.array([1 if i==counter else 0 for i in assignments],dtype=bool)
    # Take the index of the column with the smallest sum of distances from the submatrix
    idx=np.argmin(sum(sq[:,mask][mask,:]))
    # Extract names of cluster elements from nameList
    sublist=[name for (name, cluster) in nameList if cluster==counter]
    # Element closest to centroid
    centroid=sublist[idx]

最新更新