Gensim主题建模与槌困惑

我是主题建模哈佛图书馆书名和主题。

我使用Gensim Mallet包装器与Mallet的LDA建模。当我尝试获得连贯性和困惑值以查看模型的良好程度时，困惑性无法在以下异常情况下进行计算。如果我使用Gensim的内置LDA型号而不是槌槌，则不会遇到相同的错误。我的语料库保存着7m 长度的文档，最多可以平均20个单词。

以下是我代码的相关部分：

# TOPIC MODELING
from gensim.models import CoherenceModel
num_topics = 50
# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)
# Compute Perplexity
print('nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

困惑：-47.91929228302663

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('nCoherence Score: ', coherence_lda)

连贯得分：0.28852857563541856

LDA给出了毫无问题的分数。现在，我用槌槌对同一袋单词进行建模

# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, 
corpus=corpus, num_topics=num_topics, id2word=id2word)
# Convert mallet to gensim type
mallet_model = 
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('nCoherence Score: ', coherence_ldamallet)

连贯得分：0.5994123896865993

然后我要求混乱值并低于警告和NAN值。

# Compute Perplexity
print('nPerplexity: ', mallet_model.log_perplexity(corpus))

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108：运行时间沃宁：以倍数分数 =遇到的无效值 np.sum（（self.eta -_lambda） * elogbeta）
困惑：Nan
/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109：运行时间沃宁：减去分数 =遇到的无效值 np.sum（gammaln（_lambda）-gammaln（self.eta））

我意识到这是一个非常特定的问题，需要对此功能进行更深入的了解： gensim.models.wrappers.ldamallet.malletmodel2ldamodel（ldamallet）

因此，我希望对警告和Gensim域的任何评论。

我不认为为槌槌包装实现困惑函数。如Radims答案中所述，在Stdout上显示困惑：

afair，槌表现出对Stdout的困惑 - 这对您来说足够了吗？以编程方式捕获这些值也应该是可能的，但我没有研究过。希望Mallet也有一些API呼吁进行困惑评估，但它肯定不包括在包装中。

我只是在样本语料库上运行它，而LL/令牌确实被打印出来。

ll/token：-9.45493

PERPLEXITY = 2^（-ll/token）= 701.81

我的几分钱。

似乎在lda_model.log_perplexity(corpus)中，您使用与培训相同的语料库。我可能会在持有的语料库中有更好的运气。
lda_model.log_perplexity（语料库）不返回困惑。它返回"约束"。如果您想将其变成困惑，请执行np.exp2(-bound)。我为此挣扎了一段时间：）
没有办法使用槌包装器报告困惑afaik

相关内容

最新更新

热门标签：