我有这个数据处理:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
我知道有很多这样的问题,但我还没能准确地找到我的情况的答案。上面你看到了潜在狄利克雷分配模型从3到25主题数的困惑计算。我想得到其中最充分的值,这意味着我想找到肘部或膝盖,对于那些可能只被视为简单数字向量的值,结果如下:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
这就是剧情看起来像的样子
我想说肘部应该是13或16,但我不完全确定,我想要确切的数字作为结果。我在这篇论文中看到,f''(x(/(1+f'(x(^2(^1.5是膝盖公式,我试过了,说它是18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
我不能完全弄清楚这件事。有人想分享一下我如何根据困惑程度得出确切的理想话题数吗?
在本文中发现:"用肘方法(具有最大绝对二阶导数的点(获得的具有最佳相干分数的LDA模型(…(",因此该编码完成了以下工作:d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))