r语言 - 使用stm将处理过的格式转换为dtm(结构主题建模)



我使用了stm包中的textProcessorprepDocuments函数来清理语料库。现在我想将结果对象(索引列表加上词汇表)转换为标准文档-术语矩阵(或quanteda文档-特征矩阵),以便我可以应用topicmodels函数LDA并将结果主题与stm进行比较。

processed <- textProcessor(poliblog5k.docs,
metadata = poliblog5k.meta,
language = "en")
prepped <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = 20)
LDA(processed)
LDA(prepped)
> Error in x != vector(typeof(x), 1L)
LDA(processed$documents)
LDA(prepped$documents)
> Error in !all.equal(x$v, as.integer(x$v)) 

我也遇到了同样的问题。我所做的是将输出从prepDocuments转换为每行一个术语的格式,然后应用包{tidytext}中的cast_dfm函数。

library(topicmodels)
library(tidyverse)
library(tidytext)
library(magrittr)
library(stm)
stm_to_dtm <- function(out){
tibble(out_doc = out$documents %>% map(t)) %>%
mutate(out_doc = out_doc %>% map(set_colnames, c("term", "n"))) %>% 
mutate(out_doc = out_doc %>% map(as_tibble)) %>% 
rownames_to_column(var = "document") %>% 
unnest(cols = out_doc) %>% 
mutate(term = out$vocab[term]) %>% 
cast_dtm(document, term, n)
}
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
prepped <- stm_to_dtm(out)
> prepped
<<DocumentTermMatrix (documents: 341, terms: 462)>>
Non-/sparse entries: 3149/154393
Sparsity           : 98%
Maximal term length: 11
Weighting          : term frequency (tf)
> LDA(prepped, k = 5)
A LDA_VEM topic model with 5 topics.

相关内容

  • 没有找到相关文章

最新更新