Sklearn:找到聚类的平均质心位置

import pandas as pd, numpy as np, scipy
import sklearn.feature_extraction.text as text
from sklearn import decomposition
descs = ["You should not go there", "We may go home later", "Why should we do your chores", "What should we do"]
vectorizer = text.CountVectorizer()
dtm = vectorizer.fit_transform(descs).toarray()
vocab = np.array(vectorizer.get_feature_names())
nmf = decomposition.NMF(3, random_state = 1)
topic = nmf.fit_transform(dtm)

打印topic给我留下了：

>>> print(topic)
[0.       , 1.403    , 0.     ],
[0.       , 0.       , 1.637  ],
[1.257    , 0.       , 0.     ],
[0.874    , 0.056    , 0.065  ]

它们是CCD_ 2属于某个聚类的可能性中的每个元素的向量。如何获取每个簇质心的坐标？最后，我想开发一个函数来计算descs中每个元素与分配给它的簇的质心的距离

最好只计算每个集群的每个descs元素的topic值的平均值吗？

sklearn.decomposition.NMF的文档解释了如何获得每个簇的质心坐标：

属性： nbspcomponents_：数组，[n_components，n_features]
nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp nbsp；数据的非负分量。

基向量按行排列，如以下交互式会话所示：

In [995]: np.set_printoptions(precision=2)
In [996]: nmf.components_
Out[996]: 
array([[ 0.54,  0.91,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.89,  0.  ,  0.89,  0.37,  0.54,  0.  ,  0.54],
       [ 0.  ,  0.01,  0.71,  0.  ,  0.  ,  0.  ,  0.71,  0.72,  0.71,  0.01,  0.02,  0.  ,  0.71,  0.  ],
       [ 0.  ,  0.01,  0.61,  0.61,  0.61,  0.61,  0.  ,  0.  ,  0.  ,  0.62,  0.02,  0.  ,  0.  ,  0.  ]])

至于您的第二个问题，我认为">为每个集群计算每个descs元素的主题值的平均值"没有意义。在我看来，通过计算出的可能性进行分类更有意义。

相关内容

最新更新

热门标签：