聚类分析算法的准确性如何?

我有一组使用聚类算法(在本例中为 k 均值(聚类的点。我也知道真实标签，我想衡量我的聚类的准确性。我需要的是找到实际的准确性。当然，问题在于聚类给出的标签与原始标签的顺序不匹配。

有没有办法测量这种准确性？直观的想法是计算每个标签组合的混淆矩阵的分数，并且只保留最大值。有没有函数可以做到这一点？

我还使用兰德分数和调整后的兰德分数评估了我的结果。这两个度量值与实际精度有多接近？

谢谢！

首先，The problem, of course, is that the labels given by the clustering do not match the ordering of the original one.是什么意思？

如果您知道真实值标签，则可以重新排列它们以匹配X矩阵的顺序，这样，Kmeans标签将在预测后与真实标签一致。

在这种情况下，我提出以下建议。

如果您有真实值标签，并且想要查看模型的准确性，则需要 Rand 指数或预测标签和真实标签之间的互信息等指标。您可以在交叉验证方案中执行此操作，并查看模型的行为方式，即它是否可以正确预测交叉验证方案下的类/标签。预测优度的评估可以使用兰德指数等指标进行计算。

总结：

定义一个Kmeans模型并使用交叉验证，并在每次迭代中估计赋值和真实标签之间的Rand指数(或互信息(。对所有迭代重复此操作，最后取兰德指数分数的平均值。如果此分数较高，则模型良好。

完整示例：

from sklearn.cluster import KMeans
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
import numpy as np
# some data
data = load_iris()
X = data.data
y = data.target # ground truth labels
loo = LeaveOneOut()
rand_index_scores = []
for train_index, test_index in loo.split(X): # LOOCV here
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# the model
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train) # fit using training data
predicted_labels = kmeans.predict(X_test) # predict using test data
rand_index_scores.append(adjusted_rand_score(y_test, predicted_labels)) # calculate goodness of predicted labels
print(np.mean(rand_index_scores))

由于聚类是一个无监督学习问题，因此您有特定的指标：https://scikit-learn.org/stable/modules/classes.html#clustering-metrics

您可以参考scikit-learn用户指南中的讨论，以了解聚类的不同指标之间的差异：https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

例如，调整后的兰德指数将比较一对点，并检查如果标签在地面真实中相同，则在预测中是否相同。与准确性不同，您无法使标签严格相等。

您可以使用sklearn.metrics.accuracy，如下文提到的链接中所述

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

一个例子可以在下面提到的链接中看到

SKlearn：计算测试数据集上 K 均值的准确性分数

相关内容

最新更新

热门标签：