我想比较LSA和LDA模型的相干性分数。
LSA模型
lsa_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=40, random_state=5000)
lsa_top=lsa_model.fit_transform(vect_text)
LDA模型lda_model=LatentDirichletAllocation(n_components=20,learning_method='online',random_state=42,max_iter=1)
谁能帮我计算一下这两个模型的相干性分数?
提前感谢!
我使用sklearn TfidfVectorizer结合TruncatedSVD来为我的语料库找到最佳主题。找不到TruncatedSVD的内建一致性,不得不自己实现。代码基于这篇文章:
http://qpleple.com/topic-coherence-to-evaluate-topic-models/
我决定坚持内在UMass测量,因为它相对容易实现。支持方法有:
def get_umass_score(dt_matrix, i, j):
zo_matrix = (dt_matrix > 0).astype(int)
col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
col_ij = col_i + col_j
col_ij = (col_ij == 2).astype(int)
Di, Dij = col_i.sum(), col_ij.sum()
return math.log((Dij + 1) / Di)
def get_topic_coherence(dt_matrix, topic, n_top_words):
indexed_topic = zip(topic, range(0, len(topic)))
topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
coherence = 0
for j_index in range(0, len(topic_top)):
for i_index in range(0, j_index - 1):
i = topic_top[i_index][1]
j = topic_top[j_index][1]
coherence += get_umass_score(dt_matrix, i, j)
return coherence
def get_average_topic_coherence(dt_matrix, topics, n_top_words):
total_coherence = 0
for i in range(0, len(topics)):
total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
return total_coherence / len(topics)
用法:
for n_topics in range(5, 1000, 50):
svd = TruncatedSVD(n_components=n_topics, n_iter=7, random_state=42)
svd.fit(tfidf_matrix)
avg_coherence = get_average_topic_coherence(tfidf_matrix, svd.components_, 10)
print(str(n_topics) + " " + str(avg_coherence))
输出:
5 -72.44478726897546
55 -86.18040144608892
105 -88.9175058514422
155 -90.3841147807378
205 -91.83948259181923
255 -92.01751480271953 < best
305 -90.73603639282118
355 -89.85740639388695
405 -89.41916273620417
455 -87.66472648614531
505 -85.06725618307024
555 -81.1419066684933
605 -77.03963739283286
655 -73.04509144854593
705 -69.84849596265884
755 -68.01357538571055
805 -67.48039395600706
855 -67.53091204608572
905 -67.23467504644942
955 -66.86079451952988
UMass相干性越低越好。在我的情况下,255个主题最适合我的语料库。我用了10个与主题最相关的单词——你可以用你的数字。您将得到不同的数字,但主题(SVD组件)的最佳数量通常是相同的。
我使用TF-IDF向量,但这种一致性应该适用于任何基于术语频率的方法(例如BOW)