主题建模-在sklearn中运行LDA:如何计算Wordcloud



我在sklearn中训练了我的LDA模型来构建主题模型,但不知道如何为每个获得的主题计算关键词Wordcloud?

这是我的LDA模型:

vectorizer = CountVectorizer(analyzer='word',       
min_df=3,                        
max_df=6000,
stop_words='english',             
lowercase=False,                   
token_pattern ='[a-zA-Z0-9]{3,}' 
max_features=50000,             
)
data_vectorized = vectorizer.fit_transform(data_lemmatized) # data_lemmatized is all my processed document text
best_lda_model = LatentDirichletAllocation(batch_size=128, doc_topic_prior=0.1,
evaluate_every=-1, learning_decay=0.7,
learning_method='online', learning_offset=10.0,
max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_components=10, n_jobs=None,
perp_tol=0.1, random_state=None, topic_word_prior=0.1,
total_samples=1000000.0, verbose=0)
lda_output = best_lda_model.transform(data_vectorized)

我知道best_lda_model.components_赋予主题词权重。。。矢量器.get_feature_names((给出每个主题中词汇表中的所有单词。。。

非常感谢!

您必须遍历模型"components_",其大小为[n_components,n_features],因此第一个维度包含主题,第二个维度包含词汇表中每个单词的分数。因此,您首先需要找到与主题最相关的单词的索引,然后通过使用"vocab"字典(使用get_features_names((定义(,您可以检索这些单词。

import numpy as np
# define vocabulary to get words names 
vocab = vectorizer.get_feature_names()
# dictionary to store words for each topic and number of words per topic to retrive 
words = {}
n_top_words = 10
for topic, component in enumerate(model.components_):
# need [::-1] to sort the array in descending order
indices = np.argsort(component)[::-1][:n_top_words]
# store the words most relevant to the topic
words[topic] = [vocab[i] for i in indices]

最新更新