如何提高亲和传播问题的时间复杂性



我正在尝试使用Affinity Propagation聚类方法对列表中的类似模式进行聚类。self_pat是一个包含80K个需要聚类的模式的列表。我使用以下代码:

self_pat = np.asarray(self_pat) #So that indexing with a list will work
lev_similarity = -1*np.array([[calculate_levenshtein_distance(w1,w2) for w1 in self_pat] for w2 in self_pat])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)

for cluster_id in np.unique(affprop.labels_):
exemplar = words_pat[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words_pat[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))

calculate_levenshtein_distance功能如下:

def calculate_levenshtein_distance(str_1, str_2):
"""
The Levenshtein distance is a string metric for measuring the difference between two sequences.
It is calculated as the minimum number of single-character edits necessary to transform one string into another
"""
distance = 0
buffer_removed = buffer_added = 0
for x in ndiff(str_1, str_2):
code = x[0]
# Code ? is ignored as it does not translate to any modification
if code == ' ':
distance += max(buffer_removed, buffer_added)
buffer_removed = buffer_added = 0
elif code == '-':
buffer_removed += 1
elif code == '+':
buffer_added += 1
distance += max(buffer_removed, buffer_added)
return distance

上面的程序使用3个循环来执行,因此需要更多的时间来进行集群。有什么办法可以降低程序的复杂性吗?

对于较小的数据集,完成时间通常很好;对于非常大的数据集,完成一项工作所需的时间基本上是无法忍受的。正如您所发现的,集群的扩展性不好。也许你可以从你的完整数据集中随机抽取一个样本。

# Fraction of rows
# here you get .25 % of the rows
df.sample(frac = 0.25)

最新更新