Gensim LDA中的主题文档分发

python中是否有映射属于某个主题的文档的方法。例如，一个文档列表；主题0"；。我知道有很多方法可以列出每个文档的主题，但我该如何反过来做呢？

编辑：

我使用以下LDA脚本：

doc_set = []
for file in files:
newpath = (os.path.join(my_path, file)) 
newpath1 = textract.process(newpath)
newpath2 = newpath1.decode("utf-8")
doc_set.append(newpath2)
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in stopwords.words()]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)

您有一个工具/API(Gensim LDA(，当提供一个文档时，它会为您提供一个主题列表。

但你想要的恰恰相反：一个主题的文档列表。

从本质上讲，您将希望自己构建反向映射。

幸运的是Python的原生dicts&只要您使用的是完全适合内存的数据，使用映射的习惯用法就非常简单——只需几行代码。

大致的方法是：

创建一个新结构(dict或list(，用于将主题映射到文档列表
迭代所有文档，将它们(可能带有分数(添加到主题到文档的映射中
最后，针对每个感兴趣的主题，查找(也许排序(这些文档列表

如果您的问题可以进行编辑，以包含有关文档/主题的格式/ID以及您如何训练LDA模型的更多信息，则可以使用更具体的示例代码来扩展此答案，以构建所需的反向映射。

代码更新的更新：

好吧，如果你的模型在ldamodel中，而你的BOW格式的文档在corpus中，你会做一些类似的事情：

# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]
# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
# ...get its topics...
doc_topics = ldamodel.get_document_topics(doc_bow)
# ...& for each of its topics...
for topic_id, score in doc_topics:
# ...add the doc_id & its score to the topic's doc list
docs_per_topic[topic_id].append((doc_id, score))

之后，您可以看到某个主题的所有(doc_id, score)值的列表，如下所示(对于主题0(：

print(docs_per_topic[0])

如果你对每个主题的热门文档感兴趣，你可以根据得分对每个列表的配对进行进一步排序：

for doc_list in docs_per_topic:
doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)

然后，您可以获得主题0的前10个文档，如：

print(docs_per_topic[0][:10])

请注意，这一切都使用所有内存中的列表，这对于非常大的公司来说可能变得不切实际。在某些情况下，您可能需要将每个主题的列表编译成磁盘支持的结构，如文件或数据库。

相关内容

最新更新

热门标签：