r语言 - 如何在不丢失行的情况下创建 dtm - r - How to create a dtm without losing rows 小贝子编程网

我尝试运行一个 lda。

我必须使用它将其转换为适当的格式

但是，我不知道为什么我从初始输入中丢失了 2-3 个文档。

dtm <- convert(myDfm, to = "topicmodels")

因此，我可以将主题与初始数据框合并

我虽然我可以使用 dfm，但它在 lda(( 中是不可接受的格式

toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = 1)

不幸的是，我无法提供示例输入，因为它大约有 30000 行。如果我用五行的小例子测试它，解决方案工作正常。

有什么建议吗？

转换后的 dfm 正在删除空的"文档"，这可能是因为通过频率修剪或模式匹配(例如删除停用词(删除了功能。 LDA 无法处理空文档，因此默认情况下，空文档将从 LDA 格式("主题模型"、"stm 等"(中删除。

从 v1.5 开始，convert()中有一个名为omit_empty = TRUE的选项，如果要保留零功能文档，可以将其设置为FALSE。

library("quanteda")
## Package version: 1.5.1
txt <- c("one two three", "and or but", "four five")
dfmat <- tokens(txt) %>%
tokens_remove(stopwords("en")) %>%
dfm()
dfmat
## Document-feature matrix of: 3 documents, 5 features (66.7% sparse).
## 3 x 5 sparse Matrix of class "dfm"
##        features
## docs    one two three four five
##   text1   1   1     1    0    0
##   text2   0   0     0    0    0
##   text3   0   0     0    1    1

这是设置omit_empty = FALSE产生的区别：

# with and without the empty documents
convert(dfmat, to = "topicmodels")
## <<DocumentTermMatrix (documents: 2, terms: 5)>>
## Non-/sparse entries: 5/5
## Sparsity           : 50%
## Maximal term length: 5
## Weighting          : term frequency (tf)
convert(dfmat, to = "topicmodels", omit_empty = FALSE)
## <<DocumentTermMatrix (documents: 3, terms: 5)>>
## Non-/sparse entries: 5/10
## Sparsity           : 67%
## Maximal term length: 5
## Weighting          : term frequency (tf)

最后，如果要对 dfm 进行子集化以删除空文档，只需使用dfm_subset()。第二个参数被强制为一个逻辑，该逻辑在ntoken(dfmat) > 0时取值TRUE，在 0 时取FALSE值。

# subset dfm to remove the empty documents
dfm_subset(dfmat, ntoken(dfmat))
## Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
## 2 x 5 sparse Matrix of class "dfm"
##        features
## docs    one two three four five
##   text1   1   1     1    0    0
##   text3   0   0     0    1    1

r语言 - 如何在不丢失行的情况下创建 dtm

相关内容

最新更新

热门标签：