假设我有一些文本句子要使用kmeans进行聚类。
sentences = [
"fix grammatical or spelling errors",
"clarify meaning without changing it",
"correct minor mistakes",
"add related resources or links",
"always respect the original author"
]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)
现在我可以预测新文本将属于哪一类,
new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]
然而,假设我应用PCA将10000个特征减少到50个。
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)
我不能再做同样的事情来预测新文本的聚类,因为矢量器的结果不再相关
new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50
那么,我该如何将我的新文本转换到较低维度的特征空间中呢?
在将新数据提供给模型之前,您希望对其使用pca.transform
。这将使用在原始数据上运行pca.fit_transform
时拟合的相同PCA模型执行降维。然后,您可以使用拟合的模型来预测这些减少的数据。
基本上,可以将其视为拟合一个大型模型,该模型由三个较小的模型堆叠而成。首先,您有一个CountVectorizer
模型,用于确定如何处理数据。然后运行一个执行降维的RandomizedPCA
模型。最后,您运行了一个用于集群的KMeans
模型。当你装配好模型时,你就可以向下堆叠并装配每一个。当你想进行预测时,你还必须向下堆栈并应用每一个。
# Initialize models
vectorizer = CountVectorizer(min_df=1)
pca = RandomizedPCA(n_components=50, whiten=True)
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1)
# Fit models
X = vectorizer.fit_transform(sentences)
X2 = pca.fit_transform(X)
km.fit(X2)
# Predict with models
X_new = vectorizer.transform(["hello world"])
X2_new = pca.transform(X_new)
km.predict(X2_new)
使用Pipeline
:
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import RandomizedPCA
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import make_pipeline
>>> sentences = [
... "fix grammatical or spelling errors",
... "clarify meaning without changing it",
... "correct minor mistakes",
... "add related resources or links",
... "always respect the original author"
... ]
>>> vectorizer = CountVectorizer(min_df=1)
>>> svd = TruncatedSVD(n_components=5)
>>> km = KMeans(n_clusters=2, init='random', n_init=1)
>>> pipe = make_pipeline(vectorizer, svd, km)
>>> pipe.fit(sentences)
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=1))])
>>> pipe.predict(["hello, world"])
array([0], dtype=int32)
(显示TruncatedSVD
是因为RandomizedPCA
将在即将发布的版本中停止处理文本频率矩阵;它实际上执行了SVD,而不是完全PCA。)