我使用gensim LDA主题建模从语料库中获取相关主题。现在,我想获取代表每个主题的前 20 个文档:在一个主题中概率最高的文档。我想将它们保存在具有以下格式的 CSV 文件中:主题 ID 的 4 列、主题单词、主题中每个单词的概率、每个主题的前 20 个文档。
我已经尝试了get_document_topics我认为这是完成此任务的最佳方法:
all_topics = lda_model.get_document_topics(corpus, minimum_probability=0.0, per_word_topics=False(
但我不确定如何获取最能代表该主题的前 20 个文档并将它们添加到 CSV 文件中。
data_words_nostops = remove_stopwords(processed_docs)
# Create Dictionary
id2word = corpora.Dictionary(data_words_nostops)
# Create Corpus
texts = data_words_nostops
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
pprint(lda_model.print_topics())
#save csv
fn = "topic_terms5.csv"
if (os.path.isfile(fn)):
m = "a"
else:
m = "w"
num_topics=20
# save topic, term, prob data in the file
with open(fn, m, encoding="utf8", newline='') as csvfile:
fieldnames = ["topic_id", "term", "prob"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if (m == "w"):
writer.writeheader()
for topic_id in range(num_topics):
term_probs = lda_model.show_topic(topic_id, topn=6)
for term, prob in term_probs:
row = {}
row['topic_id'] = topic_id
row['prob'] = prob
row['term'] = term
writer.writerow(row)
预期结果:具有以下格式的 CSV 文件:主题 ID、主题单词、每个单词的概率、每个主题的前 20 个文档的 4 列。
首先,每个文档都有一个主题向量,一个元组列表,如下所示:
[(0, 3.0161273e-05), (1, 3.0161273e-05), (2, 3.0161273e-05), (3, 3.0161273e-05), (4,
3.0161273e-05), (5, 0.06556476), (6, 0.14744747), (7, 3.0161273e-05), (8, 3.0161273e-
05), (9, 3.0161273e-05), (10, 3.0161273e-05), (11, 0.011416071), (12, 3.0161273e-05),
(13, 3.0161273e-05), (14, 3.0161273e-05), (15, 0.057074558), (16, 3.0161273e-05),
(17, 3.0161273e-05), (18, 3.0161273e-05), (19, 3.0161273e-05), (20, 0.7178939), (21,
3.0161273e-05), (22, 3.0161273e-05), (23, 3.0161273e-05), (24, 3.0161273e-05)]
例如,其中 (0, 3.0161273e-05(,0 是主题 ID,3.0161273e-05 是概率。
您需要将此数据结构重新排列为表单,以便可以跨文档进行比较。
以下是您可以执行的操作:
#Create a dictionary, with topic ID as the key, and the value is a list of tuples
(docID, probability of this particular topic for the doc)
topic_dict = {i: [] for i in range(20)} # Assuming you have 20 topics.
#Loop over all the documents to group the probability of each topic
for docID in range(num_docs):
topic_vector = lda_model[corpus[docID]]
for topicID, prob in topic_vector:
topic_dict[topicID].append((docID, prob))
#Then, you can sort the dictionary to find the top 20 documents:
for topicID, probs in topic_dict.items():
doc_probs = sorted(probs, key = lambda x: x[1], reverse = True)
docs_top_20 = [dp[0] for dp in doc_probs[:20]]
您将获得每个主题 20 个文档的主题。您可以在列表(这将是列表列表(或字典中收集,以便可以输出它们。