如何使用Scikit从K-Means聚类中提取热门词



我目前正在对tf idf矢量化的文本数据(营销活动描述(使用K-means聚类,并有一个肘部知情的可选K,使用PCA绘制了散点图,并在我的数据框架中添加了一个带有聚类标签的列(全部使用python(。因此,从某种意义上说,我可以通过查看标记的文本数据来解释我的聚类模型。

然而,我也希望能够从每个聚类中提取N个最频繁的单词。

首先,我正在读取数据,并通过肘部获得最佳k:

# import pandas to use dataframes and handle tabular data, e.g the labeled text dataset for clustering
import pandas as pd
# read in the data using panda's "read_csv" function
col_list = ["DOC_ID", "TEXT", "CODE"]
data = pd.read_csv('/Users/williammarcellino/Downloads/AEMO_Sample.csv', usecols=col_list, encoding='latin-1')
# use regular expression to clean annoying "/n" newline characters
data = data.replace(r'n',' ', regex=True) 
#import sklearn for TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize text in the df and fit the TEXT data. Builds a vocabulary (a python dict) to map most frequent words 
# to features indices and compute word occurrence frequency (sparse matrix). Word frequencies are then reweighted 
# using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(data.TEXT)
#use elbow method to determine optimal "K"
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Sum_of_squared_distances = []
# we'll try a range of K values, use sum of squared means on new observations to deteremine new centriods (clusters) or not
K = range(6,16)
for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10)
km = km.fit(X)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

在此基础上,我构建了一个k=9:的模型

# optimal "K" value from elobow plot above
true_k = 9
# define an unsupervised clustering "model" using KMeans
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=10)
#fit model to data
model.fit(X)
# define clusters lables (which are integers--a human needs to make them interpretable)
labels=model.labels_
title=[data.DOC_ID]
#make a "clustered" version of the dataframe
data_cl=data
# add label values as a new column, "Cluster"
data_cl['Cluster'] = labels
# I used this to look at my output on a small sample; remove for large datasets in actual analyses
print(data_cl)
# output our new, clustered dataframe to a csv file
data_cl.to_csv('/Users/me/Downloads/AEMO_Sample_clustered.csv')

最后,我绘制了主要组件:

import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

model_indices = model.fit_predict(X)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X.toarray())
colors = ["r", "b", "c", "y", "m", "paleturquoise", "g", 'aquamarine', 'tab:orange']
x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))

ax.scatter(x_axis, y_axis, c=[colors[d] for d in model_indices])

for i, txt in enumerate(labels):
ax.annotate(txt, (x_axis[i]+.005, y_axis[i]), size=10)

任何从每个聚类中提取和绘制顶端项的帮助都将是一个很大的帮助。谢谢

我使用这里的代码回答了我的问题。

def get_top_features_cluster(tf_idf_array, prediction, n_feats):
prediction = km.predict(scatter_plot_points)
labels = np.unique(prediction)
dfs = []
for label in labels:
id_temp = np.where(prediction==label) # indices for each cluster
x_means = np.mean(tf_idf_array[id_temp], axis = 0) # returns average score across cluster
sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top 20 scores
features = tf_idf_vectorizor.get_feature_names()
best_features = [(features[i], x_means[i]) for i in sorted_means]
df = pd.DataFrame(best_features, columns = ['features', 'score'])
dfs.append(df)
return dfs
dfs = get_top_features_cluster(tf_idf_array, prediction, 15)

这段代码对我不起作用,所以我做了如下操作:

vectorizer = TfidfVectorizer(stop_words=stopwords)
X = vectorizer.fit_transform(dfi['text'][~dfi['text'].isna()])
print('How many clusters do you want to use?')
true_k = int(input())
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model.fit(X)
labels=model.labels_
clusters=pd.DataFrame(list(zip(dfi['text'][~dfi['text'].isna()],labels)),columns=['title','cluster'])
features = vectorizer.get_feature_names()
n_feats=15
for i in range(true_k):
cclust=X[clusters['cluster'] == i]
meanWts=cclust.A.mean(axis=0)
sorted_mean_ix = np.argsort(meanWts)[::-1][:n_feats] # indices with top 15 scores
#get most important feature names:
print(np.array(features)[sorted_mean_ix])

最新更新