如何调整/选择亲和力传播的首选项参数?



我有大量的"成对相似性矩阵"字典,如下所示:

similarity['group1']

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
[0.        , 1.        , 0.09      , 0.09      , 0.        ],
[0.        , 0.09      , 1.        , 0.94535157, 0.        ],
[0.        , 0.09      , 0.94535157, 1.        , 0.        ],
[0.        , 0.        , 0.        , 0.        , 1.        ]])

简而言之,前一个矩阵的每个元素都是record_irecord_j相似(值包括 0 和 1(,1完全相同,0完全不同的概率。

然后,我将每个相似性矩阵输入到AffinityPropagation算法中,以便对相似记录进行分组/聚类:

sim = similarities['group1']
clusterer = AffinityPropagation(affinity='precomputed', 
damping=0.5, 
max_iter=25000, 
convergence_iter=2500, 
preference=????)) # ISSUE here
affinity = clusterer.fit(sim)
cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_

但是,由于我在多个相似性矩阵上运行上述内容,因此我需要有一个我似乎无法调整的广义preference参数。

它在文档中说,默认情况下将其设置为相似性矩阵的中位数,但是我在此设置中得到了很多误报,平均值有时工作有时会给出太多的集群等......


例如:当使用首选项参数时,这些是我从相似性矩阵中获得的结果

  • preference = default # which is the median (value 0.2) of the similarity matrix:(不正确的结果,我们看到记录18不应该存在,因为与其他记录的相似性非常低(:

    # Indexes of the elements in Cluster n°5: [15, 18, 22, 27]
    {'15_18': 0.08,
    '15_22': 0.964546229533378,
    '15_27': 0.6909703138051403,
    '18_22': 0.12,    # Not Ok, the similarity is too low
    '18_27': 0.19,    # Not Ok, the similarity is too low
    '22_27': 0.6909703138051403}
    
  • preference = 0.2 in fact from 0.11 to 0.26:(正确的结果,因为记录相似(:

    # Indexes of the elements in Cluster n°5: [15, 22, 27]
    {'15_22': 0.964546229533378,
    '15_27': 0.6909703138051403,
    '22_27': 0.6909703138051403}
    

我的问题是:我应该如何以概括的方式选择这个preference参数?

如果连接的得分低于某个阈值(例如 0.5(,则可以实现幼稚和蛮力grid search解决方案,我们将使用preference参数的调整值重新运行聚类分析。

朴素的实现如下所示。


首先,一个函数,用于测试聚类分析是否需要调优,在此示例中0.5阈值:

def is_tuning_required(similarity_matrix, rows_of_cluster):
rows = similarity_matrix[rows_of_cluster]
for row in rows:
for col_index in rows_of_cluster:
score = row[col_index]
if score > 0.5:
continue
return True
return False

构建一个值的首选项范围,群集将针对该范围运行:

def get_pref_range(similarity):
starting_point = np.median(similarity)
if starting_point == 0:
starting_point = np.mean(similarity)
# Let's try to accelerate the pace of values picking
step = 1 if starting_point >= 0.05 else step = 2
preference_tuning_range = [starting_point]
max_val = starting_point
while max_val < 1:
max_val *= 1.25 if max_val > 0.1 and step == 2 else step
preference_tuning_range.append(max_val)
min_val = starting_point
if starting_point >= 0.05:
while min_val > 0.01:
min_val /= step
preference_tuning_range.append(min_val)
return preference_tuning_range

传递了preference参数的普通AfinityPropagation

def run_clustering(similarity, preference):
clusterer = AffinityPropagation(damping=0.9, 
affinity='precomputed', 
max_iter=5000, 
convergence_iter=2500, 
verbose=False, 
preference=preference)
affinity = clusterer.fit(similarity)
labels = affinity.labels_
return labels, len(set(labels)), affinity.cluster_centers_indices_

我们实际上会用相似性(1 - 距离(矩阵作为参数调用的方法:

def run_ideal_clustering(similarity):
preference_tuning_range = get_pref_range(similarity)
best_tested_preference = None
for preference in preference_tuning_range:
labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)
needs_tuning = False
wrong_clusters = 0
for label_index in range(labels_count):
cluster_elements_indexes = np.where(labels == label_index)[0]
tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
if tuning_required:
wrong_clusters += 1
if not needs_tuning:
needs_tuning = True
if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
best_tested_preference = (preference, wrong_clusters)
if not needs_tuning:
return labels, labels_count, cluster_centers_indices
# The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
return run_clustering(similarity, preference)

显然,这是一个蛮力解决方案,在大型数据集/相似性矩阵中不会执行。

如果发布更简单更好的解决方案,我会接受它。

最新更新