库函数
library(tm)
library(e1071)
library(plyr)
插入要分类的期刊名称
sample = c(
"An Inductive Inference Machine",
"Computing Machinery and Intelligence",
"On the translation of languages from left to right",
"First Draft of a Report on the EDVAC",
"The Rendering Equation")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
术语文档矩阵作为训练集
inspect(dtm)
Category=c("Machine learning","Artificial intelligence","Compilers","Computer architecture","Computer graphics")
类别的声明
my.data=data.frame(as.matrix(dtm),Category)
my.data
sample = c(
"gprof: A Call Graph Execution Profiler",
"Architecture of the IBM System/360",
"A Case for Redundant Arrays of Inexpensive Disks (RAID)",
"Determining Optical Flow",
"A relational model for large shared data banks",
"some complementarity problems of z and lyoponov like transformations on edclidean jordan algebra")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
dtm1 <- DocumentTermMatrix(corpus)
术语文档矩阵作为测试集
inspect(dtm1)
好吧,您的样本数据绝对没有重叠的术语,因此您在那里没有太多可以做的。tm
库不为单词分配意义,它只是测量它们的相关性。所以你需要提供足够的重叠数据,这样它才有机会将新的输入与现有的语料库相匹配。
一旦你有了真实的数据,你就有很多选择来构建一个模型。你可以使用像class
包中的kNN分类器,或者像rpart
包中的决策树,或者像nnet
包中的神经网络。在这个演讲中,每一种都有例子。但这取决于你决定什么对你的数据是正确的。这部分不是编程相关的问题