我使用sklearn.mixture
中的高斯混合模型(GMM)对我的数据集进行聚类。
我可以使用函数score()
来计算该模型下的对数概率。
然而,我正在寻找一个称为"纯度"的度量标准,这在本文中有定义。
如何在Python中实现它?我当前的实现如下:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
但是我不能循环遍历每个集群来计算混淆矩阵(根据这个问题)
David的答案有效,但这里有另一种方法。
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
此外,如果您需要计算反向纯度,您所需要做的就是将"axis=0"替换为"axis=1"。
sklearn
没有实现集群纯度度量。您有两个选项:
-
自行使用
sklearn
数据结构进行测量。This和This有一些用于测量纯度的python源,但您的数据或函数体都需要进行调整,以实现彼此的兼容性。 -
使用(不太成熟的)PML库,它确实实现了集群纯度。
一个非常晚的贡献。
你可以试着这样实现它,很像这个要点中的
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
当前得票最多的答案正确地实现了纯度度量,但可能不是所有情况下最合适的度量,因为它不能确保每个预测的聚类标签只分配给一个真实标签一次。
例如,考虑一个非常不平衡的数据集,其中有99个一个标签的示例和1个另一标签的示例。然后,任何聚类(例如:具有两个大小为50的相等聚类)都将达到至少0.99的纯度,使其成为一个无用的度量。
相反,在聚类数量与标签数量相同的情况下,聚类精度可能更合适。这具有在无监督设置中反映分类准确性的优点。为了计算聚类精度,我们需要使用匈牙利算法来找到聚类标签和真实标签之间的最佳匹配。SciPy函数linear_sum_assignment
的作用是:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)