匹配 scipy linkage() 和 dendrogram() 的输出

我使用代码的Z和P输出从头开始绘制树状图，如下所示(有关更完整的示例，请参见下文)：

Z = scipy.cluster.hierarchy.linkage(...)
P = scipy.cluster.hierarchy.dendrogram(Z, ..., no_plot=True)

为了做我想做的事，我需要将P["icoord"]/P["dcoord"]中的给定索引(其中包含在图中绘制集群链接的坐标)与Z中的相应索引(其中包含有关哪些数据元素在哪个集群中的信息)匹配，反之亦然。不幸的是，一般来说，集群在P["icoord"]/P["dcoord"]中的位置似乎与Z中的相应位置不匹配(请参阅下面代码的输出作为证明)。

问题：我可以通过什么方式将它们匹配？我需要一个函数Z_i = f(P_coords_i)或其逆P_coords_i = g(Z_i)，以便我可以迭代一个列表并轻松访问另一个列表中的相应元素。

下面的代码生成 26 个随机点并用字母表中的字母标记它们，然后打印出与Z行表示的聚类对应的字母，然后打印出Pdcoord为零的点(即叶节点)，以证明它们通常不匹配：例如，Z的第一个元素对应于聚类iu但P["icoord"]/P["dcoord"]中的第一组点对应于绘制jy的聚类，而iu的点直到几个元素之后才出现。

import numpy as np
from scipy.cluster import hierarchy
from scipy.spatial import distance
import string
# let's make some random data
np.random.seed(1)
data = np.random.multivariate_normal([0,0],[[5, 0], [0, 1]], 26)
letters = list(string.ascii_lowercase)
X = distance.pdist(data)

# here's the code I need to run for my use-case
Z = hierarchy.linkage(X)
P = hierarchy.dendrogram(Z, labels=letters, no_plot=True)

# let's look at the order of Z
print("Z:")
clusters = letters.copy()
for c1, c2, _, _ in Z:
clusters.append(clusters[int(c1)]+clusters[int(c2)])
print(clusters[-1])
# now let's look at the order of P["icoord"] and P["dcoord"]
print("nP:")
def lookup(y, x):
return "?" if y else P["ivl"][int((x-5)/10)]
for ((x1,x2,x3,x4),(y1,y2,y3,y4)) in zip(P["icoord"], P["dcoord"]):
print(lookup(y1, x1)+lookup(y4, x4))

输出：

------Z:
iu
ez
niu
jy
ad
pr
bq
prbq
wniu
gwniu
ezgwniu
hm
ojy
prbqezgwniu
ks
ojyprbqezgwniu
vks
ojyprbqezgwniuvks
lhm
adlhm
fadlhm
cfadlhm
tcfadlhm
ojyprbqezgwniuvkstcfadlhm
xojyprbqezgwniuvkstcfadlhm
------P:
jy
o?
pr
bq
??
ez
iu
n?
w?
g?
??
??
??
ks
v?
??
ad
hm
l?
??
f?
c?
t?
??
x?

关键思想：模仿构造R['icoord']/R['dcoord']的代码。将集群 idx 追加到空列表cluster_id_list以附加链接信息的方式。cluster_id_list和R['icoord']/R['dcoord']中的元素将"对齐"。

您可以考虑以下代码：

def append_index(n, i, cluster_id_list):
# refer to the recursive progress in
# https://github.com/scipy/scipy/blob/4cf21e753cf937d1c6c2d2a0e372fbc1dbbeea81/scipy/cluster/hierarchy.py#L3549
# i is the idx of cluster(counting in all 2 * n - 1 clusters)
# so i-n is the idx in the "Z"
if i < n:
return
aa = int(Z[i - n, 0])
ab = int(Z[i - n, 1])
append_index(n, aa, cluster_id_list)
append_index(n, ab, cluster_id_list)
cluster_id_list.append(i-n)
# Imitate the progress in hierarchy.dendrogram
# so how `i-n` is appended , is the same as how the element in 'icoord'&'dcoord' be.
return
def get_linkid_clusterid_relation(Z):
Zs = Z.shape
n = Zs[0] + 1
i = 2 * n - 2
cluster_id_list = []
append_index(n, i, cluster_id_list)
# cluster_id_list[i] is the cluster idx(in Z) that the R['icoord'][i]/R['dcoord'][i] corresponds to
dict_linkid_2_clusterid = {linkid: clusterid for linkid, clusterid in enumerate(cluster_id_list)}
dict_clusterid_2_linkid = {clusterid: linkid for linkid, clusterid in enumerate(cluster_id_list)}
return dict_linkid_2_clusterid, dict_clusterid_2_linkid

我只是模仿dendrogram函数调用_dendrogram_calculate_info函数中的递归过程。dict_linkid_2_clusterid给出了每个链接所属的集群。dict_linkid_2_clusterid[i]是P["icoord"][i]/P["dcoord"][i]响应的集群，即Z数组中 idx 的索引。dict_clusterid_2_linkid是反转映射。

注意：如果使用count_sort和distance_sort，这将影响添加链接的顺序。您可以通过从 scipy 源代码中添加额外的代码来扩展我的答案。参数truncate_mode也可以考虑在内。

<小时 />

测试代码：

dict_linkid_2_clusterid, dict_clusterid_2_linkid = get_linkid_clusterid_relation(Z)
for linkid, _ in enumerate(zip(P["icoord"], P["dcoord"])):
clusterid = dict_linkid_2_clusterid[linkid]
c1, c2, _, _ = Z[clusterid]
print(clusters[int(c1)] + clusters[int(c2)])

您可以看到，您可以在原始代码中填充未知y。

首先，定义叶标签函数。

def llf(id):
if id < n:
return str(id)
else:
return '[%d %d %1.2f]' % (id, count, R[n-id,3])

相关内容

最新更新

热门标签：