我目前正在研究python中的LDA对数。我想将这些主题隐藏在每个主题中的前 20 个单词列表中。我尝试了下面的代码,但得到了不同的输出。 我希望我的输出采用以下格式:topic=2,words=20
.
['men', 'kill', 'soldier', 'order', 'patient', 'night', 'priest', 'becom', 'new', 'speech', 'friend', 'decid', 'young', 'ward', 'state', 'front', 'would', 'home', 'two', 'father']
["n't", 'go', 'fight', 'doe', 'home', 'famili', 'car', 'night', 'say', 'next', 'ask', 'day', 'want', 'show', 'goe', 'friend', 'two', 'polic', 'name', 'meet']
我得到了以下输出:
["(u'ngma', 0.034841332255132154)", "(u'video', 0.0073756817356584745)", "(u'youtube', 0.006524039676605746)", "(u'liked', 0.0065240394176856644)",]
["(u'ngma', 0.024537057880333127)", "(u'photography', 0.0068263432438681482)", "(u'tvallwhite', 0.0029535361359022566)", "(u'3', 0.0029252727655122079)"]
我的代码:
`ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary,passes=50)
lda=ldamodel.print_topics(num_topics=2, num_words=3)
f=open('LDA.txt','w')
f.write(str(lda))
f.close()
topics_matrix = ldamodel.show_topics(formatted=False,num_words=10)
topics_matrix = np.array((topics_matrix),dtype=list)
topic_words = topics_matrix[:, 1]
for i in topic_words:
print([str(word) for word in i])
print()`
编辑-1:
topic_words = []
for i in range(3):
tt = ldamodel.get_topic_terms(i,10)
topic_words.append([pair[0] for pair in tt])
print topic_words
导致非预期输出:
[[1897, 135, 130, 127, 70, 162, 445, 656, 608, 1019], [1897, 364, 56, 1236, 181, 172, 449, 48, 15, 18], [1897, 163, 11, 70, 166, 345, 480, 9, 60, 351]]
试试这个-
from gensim import corpora
import gensim
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
# example docs
doc1 = """
Java (Indonesian: Jawa; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is an island of Indonesia.
With a population of over 141 million (the island itself) or 145 million (the
administrative region), Java is home to 56.7 percent of the Indonesian population
and is the most populous island on Earth.[1] The Indonesian capital city, Jakarta,
is located on western Java. Much of Indonesian history took place on Java. It was
the center of powerful Hindu-Buddhist empires, the Islamic sultanates, and the core
of the colonial Dutch East Indies. Java was also the center of the Indonesian struggle
for independence during the 1930s and 1940s. Java dominates Indonesia politically,
economically and culturally.
"""
doc2 = """
Hydrogen fuel is a zero-emission fuel when burned with oxygen, if one considers water
not to be an emission. It often uses electrochemical cells, or combustion in internal
engines, to power vehicles and electric devices. It is also used in the propulsion of
spacecraft and might potentially be mass-produced and commercialized for passenger vehicles
and aircraft.Hydrogen lies in the first group and first period in the periodic table, i.e.
it is the first element on the periodic table, making it the lightest element. Since
hydrogen gas is so light, it rises in the atmosphere and is therefore rarely found in
its pure form, H2."""
doc3 = """
The giraffe (Giraffa) is a genus of African even-toed ungulate mammals, the tallest living
terrestrial animals and the largest ruminants. The genus currently consists of one species,
Giraffa camelopardalis, the type species. Seven other species are extinct, prehistoric
species known from fossils. Taxonomic classifications of one to eight extant giraffe species
have been described, based upon research into the mitochondrial and nuclear DNA, as well
as morphological measurements of Giraffa, but the IUCN currently recognizes only one
species with nine subspecies.
"""
documents = [doc1, doc2, doc3]
document_wrd_splt = [[word for word in document.lower().split() if word not in STOPWORDS]
for document in documents]
dictionary = corpora.Dictionary(document_wrd_splt)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(corpus, num_topics=3, id2word = dictionary, passes=50)
num_topics = 3
topic_words = []
for i in range(num_topics):
tt = lda.get_topic_terms(i,20)
topic_words.append([dictionary[pair[0]] for pair in tt])
# output
>>> topic_words[0]
['indonesian', 'java', 'species', 'island', 'population', 'million', '(the', 'java.', 'center', 'giraffe', 'currently', 'genus', 'city,', 'economically', 'administrative', 'east', 'sundanese:', 'itself)', 'took', '1940s.']
>>> topic_words[1]
['vehicles', 'fuel', 'hydrogen', 'periodic', 'table,', 'i.e.', 'uses', 'form,', 'considers', 'zero-emission', 'internal', 'period', 'burned', 'cells,', 'rises', 'pure', 'atmosphere', 'aircraft.hydrogen', 'water', 'engines,']
>>> topic_words[2]
['giraffa,', 'even-toed', 'living', 'described,', 'camelopardalis,', 'consists', 'extinct,', 'seven', 'fossils.', 'morphological', 'terrestrial', '(giraffa)', 'dna,', 'mitochondrial', 'nuclear', 'ruminants.', 'classifications', 'species,', 'prehistoric', 'known']