如何在 Python 中的 LDA 中将主题转换为每个主题中前 20 个单词的列表



我目前正在研究python中的LDA对数。我想将这些主题隐藏在每个主题中的前 20 个单词列表中。我尝试了下面的代码,但得到了不同的输出。 我希望我的输出采用以下格式:topic=2,words=20.

['men', 'kill', 'soldier', 'order', 'patient', 'night', 'priest', 'becom', 'new', 'speech', 'friend', 'decid', 'young', 'ward', 'state', 'front', 'would', 'home', 'two', 'father']
["n't", 'go', 'fight', 'doe', 'home', 'famili', 'car', 'night', 'say', 'next', 'ask', 'day', 'want', 'show', 'goe', 'friend', 'two', 'polic', 'name', 'meet']

我得到了以下输出:

["(u'ngma', 0.034841332255132154)", "(u'video', 0.0073756817356584745)", "(u'youtube', 0.006524039676605746)", "(u'liked', 0.0065240394176856644)",]
["(u'ngma', 0.024537057880333127)", "(u'photography', 0.0068263432438681482)", "(u'tvallwhite', 0.0029535361359022566)", "(u'3', 0.0029252727655122079)"]

我的代码:

`ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary,passes=50)
lda=ldamodel.print_topics(num_topics=2, num_words=3)
f=open('LDA.txt','w')
f.write(str(lda))
f.close()
topics_matrix = ldamodel.show_topics(formatted=False,num_words=10)
topics_matrix = np.array((topics_matrix),dtype=list)
topic_words = topics_matrix[:, 1]
for i in topic_words:
print([str(word) for word in i])
print()`

编辑-1:

topic_words = []
for i in range(3):
tt = ldamodel.get_topic_terms(i,10)
topic_words.append([pair[0] for pair in tt])
print topic_words

导致非预期输出:

[[1897, 135, 130, 127, 70, 162, 445, 656, 608, 1019], [1897, 364, 56, 1236, 181, 172, 449, 48, 15, 18], [1897, 163, 11, 70, 166, 345, 480, 9, 60, 351]]

试试这个-

from gensim import corpora
import gensim
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
# example docs
doc1 = """
Java (Indonesian: Jawa; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is an island of Indonesia.
With a population of over 141 million (the island itself) or 145 million (the 
administrative region), Java is home to 56.7 percent of the Indonesian population 
and is the most populous island on Earth.[1] The Indonesian capital city, Jakarta, 
is located on western Java. Much of Indonesian history took place on Java. It was 
the center of powerful Hindu-Buddhist empires, the Islamic sultanates, and the core 
of the colonial Dutch East Indies. Java was also the center of the Indonesian struggle 
for independence during the 1930s and 1940s. Java dominates Indonesia politically, 
economically and culturally.
"""
doc2 = """
Hydrogen fuel is a zero-emission fuel when burned with oxygen, if one considers water 
not to be an emission. It often uses electrochemical cells, or combustion in internal 
engines, to power vehicles and electric devices. It is also used in the propulsion of 
spacecraft and might potentially be mass-produced and commercialized for passenger vehicles 
and aircraft.Hydrogen lies in the first group and first period in the periodic table, i.e. 
it is the first element on the periodic table, making it the lightest element. Since 
hydrogen gas is so light, it rises in the atmosphere and is therefore rarely found in 
its pure form, H2."""
doc3 = """
The giraffe (Giraffa) is a genus of African even-toed ungulate mammals, the tallest living 
terrestrial animals and the largest ruminants. The genus currently consists of one species, 
Giraffa camelopardalis, the type species. Seven other species are extinct, prehistoric 
species known from fossils. Taxonomic classifications of one to eight extant giraffe species
have been described, based upon research into the mitochondrial and nuclear DNA, as well 
as morphological measurements of Giraffa, but the IUCN currently recognizes only one 
species with nine subspecies.
"""
documents = [doc1, doc2, doc3]
document_wrd_splt = [[word for word in document.lower().split() if word not in STOPWORDS] 
for document in documents]
dictionary = corpora.Dictionary(document_wrd_splt)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(corpus, num_topics=3, id2word = dictionary, passes=50)
num_topics = 3
topic_words = []
for i in range(num_topics):
tt = lda.get_topic_terms(i,20)
topic_words.append([dictionary[pair[0]] for pair in tt])
# output
>>> topic_words[0]
['indonesian', 'java', 'species', 'island', 'population', 'million', '(the', 'java.', 'center', 'giraffe', 'currently', 'genus', 'city,', 'economically', 'administrative', 'east', 'sundanese:', 'itself)', 'took', '1940s.']
>>> topic_words[1]
['vehicles', 'fuel', 'hydrogen', 'periodic', 'table,', 'i.e.', 'uses', 'form,', 'considers', 'zero-emission', 'internal', 'period', 'burned', 'cells,', 'rises', 'pure', 'atmosphere', 'aircraft.hydrogen', 'water', 'engines,']
>>> topic_words[2]
['giraffa,', 'even-toed', 'living', 'described,', 'camelopardalis,', 'consists', 'extinct,', 'seven', 'fossils.', 'morphological', 'terrestrial', '(giraffa)', 'dna,', 'mitochondrial', 'nuclear', 'ruminants.', 'classifications', 'species,', 'prehistoric', 'known']

最新更新