使用特征散列进行集群

我必须对json格式的一些文档进行聚类。我想修补特征哈希来减少维度。从小的开始，这里是我的输入:

doc_a = { "category": "election, law, politics, civil, government",
          "expertise": "political science, civics, republican"
        }
doc_b = { "category": "Computers, optimization",
          "expertise": "computer science, graphs, optimization"
        }
doc_c = { "category": "Election, voting",
          "expertise": "political science, republican"
        }
doc_d = { "category": "Engineering, Software, computers",
          "expertise": "computers, programming, optimization"
        }
doc_e = { "category": "International trade, politics",
          "expertise": "civics, political activist"
        }

现在，我如何使用特征哈希，为每个文档创建向量，然后计算相似度并创建聚类?看完http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html后我有点迷路了。不确定我是否必须使用"字典"或将我的数据转换为有一些整数，然后使用"pair"为"input_type"到我的featureHasher。我应该如何解释的输出功能哈希?例如，示例http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html输出numpy数组。

In [1]: from sklearn.feature_extraction import FeatureHasher
In [2]: hasher = FeatureHasher(n_features=10, non_negative=True, input_type='pair')
In [3]: x_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])
In [4]: x_new.toarray()
Out[4]:
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])
In [5]:

我认为行是文档和列值是…?比如说，如果我想聚类或寻找这些向量之间的相似性(使用余弦或Jaccard)，不确定是否必须进行逐项比较?

期望输出:doc_a, doc_c和doc_e应该在一个集群中，其余的在另一个集群中。

谢谢!

如果您在这个问题中使用HashingVectorizer而不是FeatureHasher，您将使事情变得更容易。HashingVectorizer负责对您的输入数据进行标记，并可以接受字符串列表。

这个问题的主要挑战是你实际上有两种文本特征，category和expertise。这种情况下的技巧是为两个特征适配一个哈希矢量器，然后组合输出:

from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import hstack
from sklearn.cluster import KMeans
docs = [doc_a,doc_b, doc_c, doc_d, doc_e]
# vectorize both fields separately
category_vectorizer = HashingVectorizer()
Xc = category_vectorizer.fit_transform([doc["category"] for doc in docs])
expertise_vectorizer = HashingVectorizer()
Xe = expertise_vectorizer.fit_transform([doc["expertise"] for doc in docs])
# combine the features into a single data set
X = hstack((Xc,Xe))
print("X: %d x %d" % X.shape)
print("Xc: %d x %d" % Xc.shape)
print("Xe: %d x %d" % Xe.shape)
# fit a cluster model
km = KMeans(n_clusters=2)
# predict the cluster
for k,v in zip(["a","b","c","d", "e"], km.fit_predict(X)):
    print("%s is in cluster %d" % (k,v))

相关内容

最新更新

热门标签：