创建最近的中心项目列表



我目前正在做一个在大型数据集上使用k-means的项目。我想拓展一下我的大脑,不使用任何外部库,只通过创建自己的函数来实现。我已经走得相当远了,但遇到了一个问题,即不打算根据集群中心所在的位置创建一个列表。

为方便起见,我在下面创建了一个小子集数据来使用,而不是使用我拥有的整个数据集

dataset1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), 
(4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]
cluster_1 = (0, 1)
cluster_2 = (1, 2)
clusters = [cluster_1, cluster_2] # although clusters not near data, it is to practise my model

下面我有3个函数与创建集群中心点的过程有关

  1. 计算数据与聚类中心之间的距离,其中dataset中的每个点与cluster_list中的每个点进行比较
def calculate_distance(point1, point2):
distance = 0
for i in range(len(point1)):
# Euclidian distance formula
distance += (point1[i] - point2[i])**2
# result then square rooted for distance
return distance**0.5
# end of function
  1. 确定某个点最接近哪个群集中心
def find_nearest_centre(dataset1, clusters):
nearest_point = []
min_distance = 100000
# obtaining sample from cluster list
for c in clusters:
# using distance formula above to calculate distance between points
distance = calculate_distance(c, dataset)
if distance < min_distance:
min_distance = distance
nearest_point.append(min_distance)

return nearest_point
  1. 创建两个列表,每个集群一个,包含属于该集群的数据坐标
def create_list(dataset1, clusters):
# new lists created for 2 clusters
list_1 = []
list_2 = []
for d in dataset1:
# using nearest_centre formula to determine which points are closest to centres
nearest_centre = find_nearest_centre(d, clusters)
# adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
if nearest_centre == clusters[0]:
list_1.append(d)
elif nearest_centre == clusters[1]:
list_2.append(d)

return list_1, list_2
现在来谈谈我的问题。当我运行create_list函数时,它只创建两个空列表,不附加每个坐标,如预期的那样。虽然不现实,但如果前3个值在第一个集群中,最后3个值最接近第二个集群,则期望的输出将是:
create_list(dataset1, clusters) # this is only function needed to operate ideally
list_1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323)] # list of tuples output
list_2 = [(4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)] # list of tuples output

我将感谢任何帮助,我可以得到,显然坚持不使用外部包的主题。谢谢!

您得到的是空列表,因为您正在将群集与点进行比较,因此没有可能的匹配。

返回最近的群集,而不是

中的点
def find_nearest_centre(dataset, clusters):
min_distance = float("inf")
# obtaining sample from cluster list
for c in clusters:
# using distance formula above to calculate distance between points
distance = calculate_distance(c, dataset)
if distance < min_distance:
min_distance = distance
nearest_cluster = c
return nearest_cluster

,然后比较集群与集群

def create_list(dataset1, clusters):
# new lists created for 2 clusters
list_1 = []
list_2 = []
for d in dataset1:
# using nearest_centre formula to determine which points are closest to centres
nearest_cluster = find_nearest_centre(d, clusters)
# adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
if nearest_cluster == clusters[0]:
list_1.append(d)
elif nearest_cluster == clusters[1]:
list_2.append(d)
else:
print("No match")
return list_1, list_2

输出不像你期望的那样,但只是从看它,我认为在这种情况下,cluster_1应该总是更接近。

list_1 = []
list_2 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), (4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]

最新更新