我有一个表格(数据框(myTable
,其中有一列,如下所示:
sentence
1 it is a window
2 My name is john doe
3 Thank you
4 Good luck
.
.
.
我想将其转换为 R 中的术语文档矩阵。我这样做了:
tdm_s <- TermDocumentMatrix(Corpus(DataframeSource(myTable)))
但是我收到此错误:
Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
我用谷歌搜索,找不到任何东西。如何进行此转换?
您需要执行以下操作才能转换为术语文档矩阵:
## Your sample data
myTable <- data.frame(sentence = c("it is a window", "My name is john doe", "Thank you", "Good luck"))
## You need to use VectorSource before using Corpus
library(tm)
myCorpus <- Corpus(VectorSource(myTable$sentence))
tdm <- TermDocumentMatrix(myCorpus)
inspect(tdm)
#<<TermDocumentMatrix (terms: 8, documents: 4)>>
#Non-/sparse entries: 8/24
#Sparsity : 75%
#Maximal term length: 6
#Weighting : term frequency (tf)
#Sample :
# Docs
#Terms 1 2 3 4
#doe 0 1 0 0
#good 0 0 0 1
#john 0 1 0 0
#luck 0 0 0 1
#name 0 1 0 0
#thank 0 0 1 0
#window 1 0 0 0
#you 0 0 1 0
如果你不介意使用 Quanteda 包(非常好(......
require(quanteda)
# Your sample data
# Important to make sure the sentence variable is not converted to type factor
myTable <- data.frame(sentence = c("it is a window", "My name is john doe", "Thank you", "Good luck"),
stringsAsFactors = FALSE)
newcorpus <- corpus(myTable, text_field = "sentence") # you have to tell it the name of the text field
# lots of options to dfm read the help pages
newdfm <- dfm(newcorpus, remove_punct = TRUE, remove = stopwords("english"), stem = TRUE)
newdfm