R中的LDA和主题建模 - 主题,单词和概率



我使用以下代码来运行LDA并获取与主题关联的主题和单词。

keythemes <- function(x, stp = NULL){
        suppressPackageStartupMessages(library(lda))
        suppressPackageStartupMessages(library(tm))
        suppressPackageStartupMessages(library(stringr))
        x <- iconv(a$CONTENT,"WINDOWS-1252","UTF-8")
        myCorpus <- Corpus(VectorSource(x))   
        myCorpus <- tm_map(myCorpus, content_transformer(tolower), mc.cores = 1)
        myCorpus <- tm_map(myCorpus, removePunctuation, mc.cores = 1)
        myCorpus <- tm_map(myCorpus, removeNumbers, mc.cores = 1)
        myStopwords <- c(stopwords("english"), stp)
        myCorpus <- tm_map(myCorpus, removeWords, myStopwords, mc.cores = 1)
        s <- tm_map(myCorpus, stemDocument, mc.cores = 1)
        s <- TermDocumentMatrix(myCorpus, control=list(minWordLengths = 3))
        a.tdm.sp <- removeSparseTerms(s, sparse = 0.99)  
        suppressPackageStartupMessages(require(slam))
        a.tdm.sp.t <- t(a.tdm.sp) 
        term_tfidf <- tapply(a.tdm.sp.t$v/row_sums(a.tdm.sp.t)[a.tdm.sp.t$i], a.tdm.sp.t$j,mean) * log2(nDocs(a.tdm.sp.t)/col_sums(a.tdm.sp.t>0)) # calculate tf-idf values
        a.tdm.sp.t.tdif <- a.tdm.sp.t[,term_tfidf>=1.0] 
        a.tdm.sp.t.tdif <- a.tdm.sp.t[row_sums(a.tdm.sp.t) > 0, ]
        suppressPackageStartupMessages(require(topicmodels))
        best.model <- lapply(seq(2, 3, by = 1), function(d){LDA(a.tdm.sp.t.tdif, d)}) 
        best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))  
        best.model.logLik.df <- data.frame(topics=c(2:3), LL = as.numeric(as.matrix(best.model.logLik)))
        best.model.logLik.df.sort <- best.model.logLik.df[order(-best.model.logLik.df$LL), ] 
        ntop <- best.model.logLik.df.sort[1,]$topics
        set.seed(375)
        layout(matrix(c(1, 2), nrow=2), heights=c(1, 6))
        par(mar=rep(0, 4))
        plot.new()
        text(x=0.5, y=0.5, "Key themes based on the key words chosen. n Themes are populated using Latent Dirichlet Allocation.", cex = 1.2)
        lda <- LDA(a.tdm.sp.t.tdif, ntop) # generate a LDA model the optimum number of topics
        a <- get_terms(lda, 5) # get keywords for each topic, just for a quick look
        a <- data.frame(a)
        suppressPackageStartupMessages(library(gridExtra))
        grid.table(a)
}     

如何获取主题中每个单词以及每个主题的概率值。 我想要的输出如下:

Topic 1 Prob.Values  Topic 2 Prop.Values
offer       0.72      women       0.24 
amazon      0.01      shoes       0.06 
footwear    0.04      size        0.02 
flat        0.07      million     0.22

现在我只得到主题和单词。 我尝试探索伽马和贝塔值,同时lda@gamma提供了每个文档在各个主题中的比例分布,而lda@beta则为我提供了每个主题的每个单词的分数。

我不确定 beta 分数是实际概率分数还是对数可能性分数,因为这些值超过 100,并且许多值都有负分数。 数据的可重现示例如下:

structure(list(article_id = c(4.43047e+11, 4.45992e+11, 4.45928e+11, 
4.45692e+11, 4.4574e+11, 4.43754e+11), CONTENT = c("http://www.koovs.com/women/dresses/brand-koovs/sortby-price-low/ Coupon: DRESS50 Validi tii: 17th November Not valid on discounted products.", 
"Jabong has a lot to offer this winter season. So are you ready to click and pick on the all new winter store where all the products you choose are under the budget price of Rs 999 with massive discount of", 
"daughters (Sophia, Sistine and Scarlet) all wore beautiful dresses. 'GMA' Hot List: Jeff Bezos, Sylvester Stallone and a Puppy Party. More. Amazon's Jeff Bezos weights in on making space history and more in today's 60-second hot list. 1:10 | 11/24/15. Share. Title. Description. Share From. Share With. Facebook...", 
"Bags,Wallets and Belts -- AT, wildcrafts & more starting 134 only only on app Main link äóñ http://dl.flipkart.com/dl/bags-wallets-belts/pr... 134 only http://www.flipkart.com/grabbit-men-black-walle...", 
"not revert to a Techcircle.in query till the time of filing this report. Rajan has been the mobile business head of Flipkart-controlled lifestyle e-tailer Myntra since June last year. An alumnus of Delhi College of Engineering and IIM Ahmedabad, Rajan is also the co-founder of Easy2commute.com, a carpooling...", 
NA)), .Names = c("article_id", "CONTENT"), row.names = c(1299L, 
1710L, 1822L, 2371L, 2456L, 2896L), class = "data.frame")

@beta是每个主题的对数词分布,因此您可以使用以下代码将其转换为简单的概率分布:

Terms.Probability<-10^t(lda@beta)

现在,Terms.Probability 为每个主题的每个术语分布显示 0 到 1 之间的数字。

最新更新