包含20多个主题的Python LDA Gensim模型无法正确打印



使用Gensim包(LDA和Mallet(,我注意到当我创建一个包含20多个主题的模型时,并且我使用print_topics函数,它最多会打印20个主题(注意,不是前20个主题,而是任何20个主题(,它们会乱序。

所以我的问题是,我如何将所有的主题打印出来?我不确定这是一个错误还是我的问题。我回顾了我的LDA模型库(超过5000个,不同的数据源(,并注意到在主题超过20的所有模型中都会发生这种情况。

下面是带有输出的示例代码。在输出中,您将看到主题没有排序(它们应该是(,并且缺少主题,例如主题3。

lda_model = gensim.models.ldamodel.LdaModel(corpus=jr_dict_corpus,
id2word=jr_dict,
num_topics=25, 
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
pprint(lda_model.print_topics())
#note, if the model contained 20 topics, the topics would be listed in order 0-19
[(21,
'0.001*"commitment" + 0.001*"study" + 0.001*"evolve" + 0.001*"outlook" + '
'0.001*"value" + 0.001*"people" + 0.001*"individual" + 0.001*"client" + '
'0.001*"structure" + 0.001*"proposal"'),
(18,
'0.001*"self" + 0.001*"insurance" + 0.001*"need" + 0.001*"trend" + '
'0.001*"statistic" + 0.001*"propose" + 0.001*"analysis" + 0.001*"perform" + '
'0.001*"impact" + 0.001*"awareness"'),
(2,
'0.001*"link" + 0.001*"task" + 0.001*"collegiate" + 0.001*"universitie" + '
'0.001*"banking" + 0.001*"origination" + 0.001*"security" + 0.001*"standard" '
'+ 0.001*"qualifications_bachelor" + 0.001*"greenfield"'),
(11,
'0.024*"collegiate" + 0.016*"interpersonal" + 0.016*"prepare" + '
'0.016*"invite" + 0.016*"aspect" + 0.016*"college" + 0.016*"statistic" + '
'0.016*"continent" + 0.016*"structure" + 0.016*"project"'),
(10,
'0.049*"enjoy" + 0.049*"ambiguity" + 0.017*"accordance" + 0.017*"liberalize" '
'+ 0.017*"developing" + 0.017*"application" + 0.017*"vacancie" + '
'0.017*"service" + 0.017*"initiative" + 0.017*"discontinuing"'),
(20,
'0.028*"negotiation" + 0.028*"desk" + 0.018*"enhance" + 0.018*"engage" + '
'0.018*"discussion" + 0.018*"ability" + 0.018*"depth" + 0.018*"derive" + '
'0.018*"enjoy" + 0.018*"balance"'),
(12,
'0.036*"individual" + 0.024*"validate" + 0.018*"greenfield" + '
'0.018*"capability" + 0.018*"coordinate" + 0.018*"create" + '
'0.018*"programming" + 0.018*"safety" + 0.010*"evaluation" + '
'0.002*"reliability"'),
(1,
'0.028*"negotiation" + 0.021*"responsibility" + 0.014*"master" + '
'0.014*"mind" + 0.014*"experience" + 0.014*"worker" + 0.014*"ability" + '
'0.007*"summary" + 0.007*"proposal" + 0.007*"alert"'),
(23,
'0.043*"banking" + 0.026*"origination" + 0.026*"round" + 0.026*"credibility" '
'+ 0.026*"entity" + 0.018*"standard" + 0.017*"range" + 0.017*"pension" + '
'0.017*"adapt" + 0.017*"information"'),
(13,
'0.034*"priority" + 0.034*"reconciliation" + 0.034*"purchaser" + '
'0.023*"reporting" + 0.023*"offer" + 0.023*"investor" + 0.023*"share" + '
'0.023*"region" + 0.023*"service" + 0.023*"manipulate"'),
(22,
'0.017*"analyst" + 0.017*"modelling" + 0.016*"producer" + 0.016*"return" + '
'0.016*"self" + 0.009*"scope" + 0.008*"mind" + 0.008*"need" + 0.008*"detail" '
'+ 0.008*"statistic"'),
(9,
'0.021*"decision" + 0.014*"invite" + 0.014*"balance" + 0.014*"commercialize" '
'+ 0.014*"transform" + 0.014*"manage" + 0.014*"optionality" + '
'0.014*"problem_solving" + 0.014*"fuel" + 0.014*"stay"'),
(7,
'0.032*"commitment" + 0.032*"study" + 0.016*"impact" + 0.016*"outlook" + '
'0.011*"operation" + 0.011*"expand" + 0.011*"exchange" + 0.011*"management" '
'+ 0.011*"conde" + 0.011*"evolve"'),
(15,
'0.032*"agility" + 0.019*"feasibility" + 0.019*"self" + 0.014*"deploy" + '
'0.014*"define" + 0.013*"investment" + 0.013*"option" + 0.013*"control" + '
'0.013*"action" + 0.013*"incubation"'),
(5,
'0.020*"desk" + 0.018*"agility" + 0.016*"vender" + 0.016*"coordinate" + '
'0.016*"committee" + 0.012*"acquisition" + 0.012*"target" + '
'0.012*"counterparty" + 0.012*"approval" + 0.012*"trend"'),
(17,
'0.022*"option" + 0.017*"working" + 0.017*"niche" + 0.011*"business" + '
'0.011*"constrain" + 0.011*"meeting" + 0.011*"correspond" + 0.011*"exposure" '
'+ 0.011*"element" + 0.011*"face"'),
(0,
'0.025*"expertise" + 0.025*"banking" + 0.021*"universitie" + '
'0.017*"spreadsheet" + 0.013*"negotiation" + 0.013*"shipment" + '
'0.013*"arise" + 0.013*"billing" + 0.013*"assistance" + 0.013*"sector"'),
(4,
'0.024*"provide" + 0.017*"consider" + 0.017*"allow" + 0.015*"outlook" + '
'0.015*"value" + 0.015*"contract" + 0.012*"study" + 0.012*"technology" + '
'0.012*"scenario" + 0.012*"indicator"'),
(6,
'0.058*"impulse" + 0.027*"shall" + 0.027*"shape" + 0.024*"marketer" + '
'0.017*"availability" + 0.014*"determine" + 0.014*"load" + '
'0.014*"constantly_change" + 0.014*"instrument" + 0.014*"interface"'),
(19,
'0.042*"task" + 0.038*"tariff" + 0.038*"recommend" + 0.024*"example" + '
'0.023*"future" + 0.021*"people" + 0.021*"math" + 0.021*"capacity" + '
'0.021*"spirit" + 0.020*"price"')]

与上面的模型相同,但使用了20个主题。正如您所看到的,输出按主题#顺序排列,它包含所有主题。

lda_model = gensim.models.ldamodel.LdaModel(corpus=jr_dict_corpus,
id2word=jr_dict,
num_topics=20, 
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
pprint(lda_model.print_topics())
[(0,
'0.031*"enjoy" + 0.031*"ambiguity" + 0.028*"accordance" + 0.016*"statistic" '
'+ 0.016*"initiative" + 0.016*"service" + 0.016*"liberalize" + '
'0.016*"application" + 0.011*"community" + 0.011*"identifie"'),
(1,
'0.016*"transformation" + 0.016*"negotiation" + 0.016*"community" + '
'0.016*"clock" + 0.011*"marketer" + 0.011*"desk" + 0.011*"mandate" + '
'0.011*"closing" + 0.011*"initiative" + 0.011*"experience"'),
(2,
'0.026*"priority" + 0.026*"reconciliation" + 0.026*"purchaser" + '
'0.020*"safety" + 0.020*"region" + 0.020*"query" + 0.020*"share" + '
'0.020*"manipulate" + 0.020*"ibex" + 0.020*"investor"'),
(3,
'0.022*"improve" + 0.021*"committee" + 0.021*"affect" + 0.012*"target" + '
'0.012*"acquisition" + 0.011*"basis" + 0.011*"profitability" + '
'0.011*"economic" + 0.011*"natural" + 0.011*"profit"'),
(4,
'0.024*"provide" + 0.019*"value" + 0.017*"consider" + 0.017*"allow" + '
'0.015*"scenario" + 0.015*"outlook" + 0.015*"contract" + 0.014*"forecast" + '
'0.014*"decision" + 0.012*"indicator"'),
(5,
'0.037*"desk" + 0.030*"coordinate" + 0.030*"agility" + 0.030*"vender" + '
'0.023*"counterparty" + 0.023*"immature_emerge" + 0.023*"metric" + '
'0.022*"approval" + 0.015*"maximization" + 0.015*"undergraduate"'),
(6,
'0.053*"impulse" + 0.025*"shall" + 0.025*"shape" + 0.018*"availability" + '
'0.018*"marketer" + 0.012*"determine" + 0.012*"language" + '
'0.012*"monitoring" + 0.012*"integration" + 0.012*"month"'),
(7,
'0.026*"commitment" + 0.026*"study" + 0.013*"impact" + 0.013*"outlook" + '
'0.009*"operation" + 0.009*"management" + 0.009*"expand" + 0.009*"exchange" '
'+ 0.009*"conde" + 0.009*"balance"'),
(8,
'0.057*"insurance" + 0.029*"propose" + 0.028*"rule" + 0.026*"self" + '
'0.023*"product" + 0.023*"asset" + 0.023*"pricing" + 0.023*"amount" + '
'0.023*"result" + 0.020*"liquidity"'),
(9,
'0.012*"universitie" + 0.012*"need" + 0.012*"statistic" + 0.012*"trend" + '
'0.008*"invite" + 0.008*"commercialize" + 0.008*"transform" + 0.008*"manage" '
'+ 0.008*"problem_solving" + 0.008*"optionality"'),
(10,
'0.024*"background" + 0.024*"curve" + 0.020*"allow" + 0.019*"collect" + '
'0.019*"basis" + 0.017*"accordance" + 0.013*"improve" + 0.013*"datum" + '
'0.013*"component" + 0.013*"reliability"'),
(11,
'0.054*"task" + 0.049*"tariff" + 0.049*"recommend" + 0.031*"future" + '
'0.027*"spirit" + 0.027*"capacity" + 0.027*"math" + 0.022*"ensure" + '
'0.022*"profit" + 0.022*"variable_margin"'),
(12,
'0.001*"impulse" + 0.001*"availability" + 0.001*"reliability" + '
'0.001*"shall" + 0.001*"component" + 0.001*"agent" + 0.001*"marketer" + '
'0.001*"shape" + 0.001*"assisting" + 0.001*"supply"'),
(13,
'0.021*"region" + 0.016*"greenfield" + 0.016*"collegiate" + 0.011*"transfer" '
'+ 0.011*"remuneration" + 0.011*"organization" + 0.011*"structure" + '
'0.011*"continent" + 0.011*"project" + 0.011*"prepare"'),
(14,
'0.033*"originator" + 0.025*"vender" + 0.025*"expertise" + 0.025*"banking" + '
'0.019*"evolve" + 0.017*"management" + 0.017*"market" + 0.017*"site" + '
'0.012*"component" + 0.012*"discontinuing"'),
(15,
'0.027*"agility" + 0.022*"mind" + 0.022*"negotiation" + 0.011*"deploy" + '
'0.011*"define" + 0.011*"ecosystem" + 0.011*"control" + 0.011*"lead" + '
'0.011*"industry" + 0.011*"option"'),
(16,
'0.001*"region" + 0.001*"master" + 0.001*"orginiation" + 0.001*"greenfield" '
'+ 0.001*"agent" + 0.001*"identifie" + 0.001*"remuneration" + 0.001*"mark" + '
'0.001*"reviewing" + 0.001*"closing"'),
(17,
'0.030*"banking" + 0.018*"option" + 0.018*"round" + 0.018*"credibility" + '
'0.018*"origination" + 0.018*"entity" + 0.016*"working" + 0.015*"niche" + '
'0.015*"standard" + 0.012*"coordinate"'),
(18,
'0.027*"negotiation" + 0.018*"reporting" + 0.018*"perform" + 0.018*"world" + '
'0.015*"offer" + 0.015*"manipulate" + 0.011*"query" + 0.010*"control" + '
'0.010*"working" + 0.009*"self"'),
(19,
'0.047*"example" + 0.039*"people" + 0.039*"price" + 0.039*"excel" + '
'0.039*"excellent" + 0.038*"base" + 0.031*"office" + 0.031*"optimizing" + '
'0.031*"participate" + 0.031*"package"')]

print_topics的默认主题数为20。您必须使用num_topics参数来包括20…以上的主题

print(lda_model.print_topics(num_topics=25, num_words=10))

最新更新