使用平均移位文档聚类

我拿了一堆文档，并计算了所有文档中每个令牌的tf*idf我无法使用sklearn.cluster.meanshift

从向量创建群集，我无法弄清楚如何创建群集。

tfidfvectorizer将文档转换为数字的"稀疏矩阵"。ReeShift要求将数据传递给其"密集"。下面，我显示如何在管道中转换它（信用），但是，内存允许，您只需将稀疏矩阵转换为用toarray()或todense()。

。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MeanShift
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
documents = ['this is document one',
             'this is document two',
             'document one is fun',
             'document two is mean',
             'document is really short',
             'how fun is document one?',
             'mean shift... what is that']
pipeline = Pipeline(
  steps=[
    ('tfidf', TfidfVectorizer()),
    ('trans', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
    ('clust', MeanShift())
  ])
pipeline.fit(documents)
pipeline.named_steps['clust'].labels_
result = [(label,doc) for doc,label in zip(documents, pipeline.named_steps['clust'].labels_)]
for label,doc in sorted(result):
  print(label, doc)

打印：

0 document two is mean
0 this is document one
0 this is document two
1 document one is fun
1 how fun is document one?
2 mean shift... what is that
3 document is really short

您可以修改"超参数"，但这为您提供了一个我认为的一般想法。

相关内容

最新更新

热门标签：