如何使用 sklearn k 均值聚类根据 *特征*之间的相关性对它们进行聚类 - How to cluster *features* based on their correlations to each other with sklearn k-means clustering 小贝子编程网

我有一个熊猫数据帧，其中行作为记录(患者(，105列作为特征。每个患者的属性(

我想聚类，不是患者，不是按照惯例对行进行聚类，而是对列进行聚类，以便我可以看到哪些特征与哪些其他特征相似或相关。我已经可以使用df.corr()计算每个特征与其他每个特征的相关性。但是我怎样才能将这些聚类到 k=2,3,4 中......使用sklearn.cluster.KMeans？

我尝试KMeans(n_clusters=2).fit(df.T)确实对特征进行聚类(因为我采用了矩阵的转置(，但仅使用欧几里得距离函数，而不是根据它们的相关性。我更喜欢根据相关性对特征进行聚类。

这应该很容易，但我会感谢您的帮助。

KMeans在这种情况下不是很有用，但您可以使用任何可以处理距离矩阵的聚类方法。例如 - 聚集聚类。

我将使用 scipy，sklearn 版本更简单，但没有那么强大(例如，在 sklearn 中，您不能将 WARD 方法与距离矩阵一起使用(。

from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
df = ...  # your dataframe with many features
corr = df.corr()  # we can consider this as affinity matrix
distances = 1 - corr.abs().values  # pairwise distnces
distArray = ssd.squareform(distances)  # scipy converts matrix to 1d array
hier = hierarchy.linkage(distArray, method="ward")  # you can use other methods

阅读文档以了解hier结构。

您可以使用以下方法打印树状图

dend = hierarchy.dendrogram(hier, truncate_mode="level", p=30, color_threshold=1.5)

最后，获取要素的聚类标签

threshold = 1.5  # choose threshold using dendrogram or any other method (e.g. quantile or desired number of features)
cluster_labels = hierarchy.fcluster(hier, threshold, criterion="distance")

> 通过获取所有特征的相关性来创建新矩阵df.corr()，现在使用此新矩阵作为 k 均值算法的数据集。这将为您提供具有相似相关性的特征集群。

如何使用 sklearn k 均值聚类根据特征之间的相关性对它们进行聚类

相关内容

最新更新

热门标签：