sklearn聚集聚类:动态更新聚类数量

sklearn.cluster.AglognitiveClustering的文档提到，

当改变集群的数量并使用高速缓存时，计算完整的树可能是有利的。

这似乎意味着可以首先计算完整的树，然后根据需要快速更新所需集群的数量，而无需重新计算树（使用缓存）。

然而，这个改变集群数量的过程似乎没有记录在案。我想这样做，但不确定如何进行。

更新：为了澄清，拟合方法不将聚类数量作为输入：http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit

使用参数memory = 'mycachedir'设置缓存目录，然后如果设置compute_full_tree=True，当使用不同值的n_clusters重新运行fit时，它将使用缓存的树，而不是每次重新计算。给你一个如何使用sklearn的网格搜索API的例子：

from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir', 
                             compute_full_tree=True)
classifier = GridSearchCV(ac, 
                          {n_clusters: range(2,6)}, 
                          scoring = 'adjusted_rand_score', 
                          n_jobs=-1, verbose=2)
classifier.fit(X,y)

我知道这是一个老问题，但下面的解决方案可能会对有所帮助

# scores = input matrix
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
linkage_mat = linkage(scores, method="ward")
euc_scores = euclidean_distances(scores)
n_l = 2
n_h = scores.shape[0]
silh_score = -2
# Selecting the best number of clusters based on the silhouette score
for i in range(n_l, n_h):
    local_labels = list(cut_tree(linkage_mat, n_clusters=i).flatten())
    sc = silhouette_score(
        euc_scores,
        metric="precomputed",
        labels=local_labels,
        random_state=42)
    if silh_score < sc:
        silh_score = sc
        labels = local_labels
n_clusters = len(set(labels))
print(f"Optimal number of clusters: {n_clusters}")
print(f"Best silhouette score: {silh_score}")
# ...

相关内容

最新更新

热门标签：