如何找到"x"数量的最接近质心的元素



我正在研究一个非常高维的数据集,并对其进行了k-means聚类。我正在努力找到离每个质心最近的20个点。数据集的维度(X_emb(为10 X 2816。提供的是我用来找到每个质心最接近的一个点的代码。注释掉的代码是我发现的一个潜在的解决方案,但我无法使其准确工作。

import numpy as np
import pickle as pkl
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.neighbors import NearestNeighbors
from visualization.make_video_v2 import make_video_from_numpy
from scipy.spatial import cKDTree
n_s_train = 10000
df = pkl.load(open('cluster_data/mixed_finetuning_data.pkl', 'rb'))
N = len(df)
X = []
X_emb = []
for i in range(N):
play = df.iloc[i]
if df.iloc[i].label == 1:
X_emb.append(play['embedding'])
X.append(play['input'])

X_emb = np.array(X_emb)
kmeans = KMeans(n_clusters=10)
kmeans.fit(X_emb)
results = kmeans.cluster_centers_
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)

# def find_k_closest(centroids, data, k=1, distance_norm=2):
#     kdtree = cKDTree(data, leafsize=30)
#     distances, indices = kdtree.query(centroids, k, p=distance_norm)
#     if k > 1:
#         indices = indices[:,-1]
#     values = data[indices]
#     return indices, values
# indices, values = find_k_closest(results, X_emb)

您可以使用成对距离来计算具有X_emb中每个点的质心的每个点的距离,然后使用numpy找到最小20个元素的索引,并最终从X_emb 中获取它们

from sklearn.metrics import pairwise_distances
distances = pairwise_distances(centroids, X_emb, metric='euclidean')
ind = [np.argpartition(i, 20)[:20] for i in distances]
closest = [X_emb[indexes] for indexes in ind]

最近的形状将是(质心数量x 20(

您可以通过以下方式从sk学习NearestNeighbors类:

from sklearn.neighbors import NearestNeighbors
def find_k_closest(centroids, data):
nns = {}
neighbors = NearesNieghbors(n_neighbors=20).fit(data)
for center in centroids:
nns[center] = neighbors.kneighbors(center, return_distance=false)
return nns

nns字典应该包含作为关键字的中心和作为值的邻居列表

最新更新