创建最近的中心项目列表

我目前正在做一个在大型数据集上使用k-means的项目。我想拓展一下我的大脑，不使用任何外部库，只通过创建自己的函数来实现。我已经走得相当远了，但遇到了一个问题，即不打算根据集群中心所在的位置创建一个列表。

为方便起见，我在下面创建了一个小子集数据来使用，而不是使用我拥有的整个数据集

dataset1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), 
(4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]
cluster_1 = (0, 1)
cluster_2 = (1, 2)
clusters = [cluster_1, cluster_2] # although clusters not near data, it is to practise my model

下面我有3个函数与创建集群中心点的过程有关

计算数据与聚类中心之间的距离，其中dataset中的每个点与cluster_list中的每个点进行比较

def calculate_distance(point1, point2):
distance = 0
for i in range(len(point1)):
# Euclidian distance formula
distance += (point1[i] - point2[i])**2
# result then square rooted for distance
return distance**0.5
# end of function

确定某个点最接近哪个群集中心

def find_nearest_centre(dataset1, clusters):
nearest_point = []
min_distance = 100000
# obtaining sample from cluster list
for c in clusters:
# using distance formula above to calculate distance between points
distance = calculate_distance(c, dataset)
if distance < min_distance:
min_distance = distance
nearest_point.append(min_distance)

return nearest_point

创建两个列表，每个集群一个，包含属于该集群的数据坐标

def create_list(dataset1, clusters):
# new lists created for 2 clusters
list_1 = []
list_2 = []
for d in dataset1:
# using nearest_centre formula to determine which points are closest to centres
nearest_centre = find_nearest_centre(d, clusters)
# adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
if nearest_centre == clusters[0]:
list_1.append(d)
elif nearest_centre == clusters[1]:
list_2.append(d)

return list_1, list_2

现在来谈谈我的问题。当我运行create_list函数时，它只创建两个空列表，不附加每个坐标，如预期的那样。虽然不现实，但如果前3个值在第一个集群中，最后3个值最接近第二个集群，则期望的输出将是:

create_list(dataset1, clusters) # this is only function needed to operate ideally
list_1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323)] # list of tuples output
list_2 = [(4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)] # list of tuples output

我将感谢任何帮助，我可以得到，显然坚持不使用外部包的主题。谢谢!

您得到的是空列表，因为您正在将群集与点进行比较，因此没有可能的匹配。

返回最近的群集，而不是

中的点

def find_nearest_centre(dataset, clusters):
min_distance = float("inf")
# obtaining sample from cluster list
for c in clusters:
# using distance formula above to calculate distance between points
distance = calculate_distance(c, dataset)
if distance < min_distance:
min_distance = distance
nearest_cluster = c
return nearest_cluster

，然后比较集群与集群

def create_list(dataset1, clusters):
# new lists created for 2 clusters
list_1 = []
list_2 = []
for d in dataset1:
# using nearest_centre formula to determine which points are closest to centres
nearest_cluster = find_nearest_centre(d, clusters)
# adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
if nearest_cluster == clusters[0]:
list_1.append(d)
elif nearest_cluster == clusters[1]:
list_2.append(d)
else:
print("No match")
return list_1, list_2

输出不像你期望的那样，但只是从看它，我认为在这种情况下，cluster_1应该总是更接近。

list_1 = []
list_2 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), (4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]

相关内容

最新更新

热门标签：