Kmeans:在多个集群中出现的术语

使用Kmeans与TF-IDF矢量器是否有可能获得在多个集群中出现的术语?

以下是示例的数据集:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我使用TF-IDF矢量器进行特征提取:

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s,' % terms[ind],
    print

当我使用scikit-learn中的KMeans聚类文档时，结果如下:

Top terms per cluster:
Cluster 0:  user,  eps,  interface,  human,  response,  time,  computer,  management,  engineering,  testing,
Cluster 1:  trees,  intersection,  paths,  random,  generation,  unordered,  binary,  graph,  interface,  human,
Cluster 2:  minors,  graph,  survey,  widths,  ordering,  quasi,  iv,  trees,  engineering,  eps,

我们可以看到一些术语出现在多个聚类中(例如;g, graph在集群1和2中，eps在集群0和2中).

聚类结果是否错误?还是因为每个文档的上述条款的tf-idf分数不同而可以接受?

我想你对你想做的事情有点困惑。您使用的代码为您提供文档的聚类，而不是术语。这些项是聚类的维度。

如果您想找到每个文档属于哪个集群，您只需要使用predict或fit_predict方法，如下所示:

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
for n in range(9):
    print("Doc %d belongs to cluster %d. " % (n, km.predict(feature[n])))

得到:

Doc 0 belongs to cluster 2. 
Doc 1 belongs to cluster 1. 
Doc 2 belongs to cluster 2. 
Doc 3 belongs to cluster 2. 
Doc 4 belongs to cluster 1. 
Doc 5 belongs to cluster 0. 
Doc 6 belongs to cluster 0. 
Doc 7 belongs to cluster 0. 
Doc 8 belongs to cluster 1.

查看Scikit-learn用户指南

相关内容

最新更新

热门标签：