我想对以下列表执行无监督学习,并使用这些知识来预测测试列表中每个项目的值
#Format [real_runtime, processors, requested_time, score, more_to_be_added]
#some entries from the list
训练数据集Xsrc = [['354', '2048', '3600', '53.0521472395'],
['605', '2048', '600', '54.8768871369'],
['128', '2048', '600', '51.0'],
['136', '2048', '900', '51.0000000563'],
['19218', '480', '21600', '51.0'],
['15884', '2048', '18000', '51.0'],
['118', '2048', '1500', '51.0'],
['103', '2048', '2100', '51.0000002839'],
['18542', '480', '21600', '51.0000000001'],
['13272', '2048', '18000', '51.0000000001']]
测试数据集
使用集群,我想预测新列表的real_runtime:Xtest= [['-1', '2048', '1500', '51.0000000161'],['-1', '2048', '10800', ' 51000000002 '],['-1', '512', '21600', '-1'],['-1', '512', '2700', ' 51000000004 '],['-1, '1024', '21600', '51.1042617556']]
代码:在python中使用scikit格式化列表和制作集群并绘制集群
from sklearn.feature_selection import VarianceThreshold
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
##Training dataset
Xsrc = [['354', '2048', '3600', '53.0521472395'],
['605', '2048', '600', '54.8768871369'],
['128', '2048', '600', '51.0'],
['136', '2048', '900', '51.0000000563'],
['19218', '480', '21600', '51.0'],
['15884', '2048', '18000', '51.0'],
['118', '2048', '1500', '51.0'],
['103', '2048', '2100', '51.0000002839'],
['18542', '480', '21600', '51.0000000001'],
['13272', '2048', '18000', '51.0000000001']]
print "Xsrc:", Xsrc
##Test data set
Xtest= [['1224', '2048', '1500', '51.0000000161'],
['7867', '2048', '10800', '51.0000000002'],
['21594', '512', '21600', '-1'],
['1760', '512', '2700', '51.0000000004'],
['115', '1024', '21600', '51.1042617556']]
##Clustering
X = StandardScaler().fit_transform(Xsrc)
db = DBSCAN(min_samples=2).fit(X) #no clustering parameter, such as default eps
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
clusters = [X[labels == i] for i in xrange(n_clusters_)]
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
##Plotting the dataset
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=20)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=10)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
有什么想法我可以使用集群来预测值吗?
聚类不是预测
"预测"一个聚类标签几乎没有用,因为它只是被聚类算法"随机"分配的。
更糟的是:大多数算法不能合并新数据。
你真的应该使用聚类来探索你的数据,并了解有什么,没有什么。不要指望集群是"好的"
有时,人们成功地将量化数据集到k个中心,然后仅使用该"压缩"数据集进行分类/预测(通常仅基于最近邻)。我也看到过这样的想法,即每个集群训练一个回归来进行预测,并使用最近邻来选择回归量(即,如果数据很适合集群,则使用集群回归模型)。