如何从向量的数据帧中找到K个最近的邻居



我有一个名为neighbours_lookup的数据帧,其中一列ID和一列规范化数据('vec'(存储为数组:

id  vec
0   857827315   [-0.5345224838248487, -0.5345224838248487, 1.8...
1   857827311   [-0.3535533905932738, -0.3535533905932738, 2.8...
2   857827316   [-0.3535533905932738, -0.3535533905932738, -0....
3   857827312   [-0.5345224838248487, 1.8708286933869707, -0.5...
4   857827313   [-0.35355339059327373, -0.35355339059327373, -...

我想写一个函数,在这里我可以输入一个ID,并找回10个最近的邻居。

我看了skikit.neighbours,我认为它看起来很相关——然而,我不知道如何使用它。我试过

knn = NearestNeighbors(n_neighbors=10,
algorithm='auto')
for row in neighbours_lookup['vec']:
knn.fit(row.reshape(1, -1))

我得到的错误是

AttributeError: 'list' object has no attribute 'reshape'

有人能解释一下我该去哪里吗?我的数据帧将具有>100000行,所以我需要它快速。

---编辑---

多亏了达斯爸爸和我自己的折腾,我才成功了!下面的函数。

def get_k_neighbours(isbn,df,number_of_neighbours):
def get_knn(df):
vector_arrays = df['vec'].to_numpy().tolist()
return NearestNeighbors().fit(vector_arrays)        
def get_vector(df, isbn):
return df.loc[df['isbn'] == isbn, 'vec'].iloc[0].reshape(1, -1)
def flatten_neighbour_list(nb_indexes):
nb_list = nb_indexes.tolist()
return [item for sublist in nb_list for item in sublist]        
knn = get_knn(df)
vector = get_vector(df, isbn)
nb_indexes = knn.kneighbors(vector,number_of_neighbours,return_distance=False)
nb_indexes = flatten_neighbour_list(nb_indexes)
return nb_indexes

Numpy ndarray有一个属性整形,因此没有列出AttributeError。可以将形状列表(n_samples,n_features(的列表调整为"最近邻居"。

from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=10, algorithm='auto')
knn.fit(neighbours_lookup['vec'].to_numpy())
def get_neighbors(id):
vector = neighbours_lookup.loc[id]
return knn.kneighbors([vector], 10, return_distance=False)

最新更新