R - 如何调整使用 RTextTools 创建的文本分类器



我正在尝试使用 R 中的 RTextTools 库创建一个文本分类器。训练和测试数据帧的格式相同。它们都由两列组成:第一列是文本,第二列是标签。

到目前为止,我的程序的最小可重现示例(替换数据(:

# Packages
## Install
install.packages('e1071', 'RTextTools')
## Import
library(e1071)
library(RTextTools)
data.train <- data.frame("content" = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."), "label" = c("yes", "yes", "no"))
data.test <- data.frame("content" = c("It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."), "label" = c("no", "yes", "yes"))
# Process training dataset
data.train.dtm <- create_matrix(data.train$content, language = "english", weighting = tm::weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms = 0, removeStopwords = TRUE,  stemWords = TRUE, stripWhitespace = TRUE, toLower = TRUE)
data.train.container <- create_container(data.train.dtm, data.train$label, trainSize = 1:nrow(data.train), virgin = FALSE)
# Create linear SVM model
model.linear <- train_model(data.train.container, "SVM", kernel = "linear", cost = 10, gamma = 1^-2)
# Process testing dataset
data.test.dtm <- create_matrix(data.test$content, originalMatrix = data.train.dtm)
data.test.container <- create_container(data.test.dtm, labels = rep(0, nrow(data.test)), testSize = 1:nrow(data.test), virgin = FALSE)
# Classify testing dataset
model.linear.results <- classify_model(data.test.container, model.linear)
model.linear.results.table <- table(Predicted = model.linear.results$SVM_LABEL, Actual = data.test$label) 
model.linear.results.table

到目前为止,我拥有的代码有效,并导致一个将预测值与实际值进行比较的表格。结果非常不准确,我很清楚该模型需要微调。

我知道 e1071 库(RTextTools 基于该库(包含一个tune.svm函数,用于返回最佳成本和 gamma 值以产生最佳结果。使用它的问题在于 tune.svm 函数上的data参数需要读入数据帧,但由于我正在执行文本分类器,因此我不只是将简单的数据帧读取到 SVM 中,而是将文档术语矩阵。

无济于事,我尝试将 DTM 作为数据帧读取,如下所示:

model.tuned <- tune.svm(label~., data = as.data.frame(data.train.dtm), gamma = 10^(-6:-1), cost = 10^(-1:1))

我完全迷失了,任何见解将不胜感激。

您可以查看train_model中的代码(在 RStudio 中按 F2(,以了解它如何使用容器调用svm()(在您的情况下为data.train.container(。默认情况下,train_model使用

  • cross=0(不对训练数据执行交叉验证(
  • cost=100(违反约束的成本(
  • probability=TRUE(模型应允许概率预测(
  • kernel="radial"(用于 SVM 训练的径向内核(

作为要传递到svm()的参数。

为了实际回答您的问题,create_container()返回的容器具有插槽training_matrixtraining_codes,您可以在下面使用:

model.tuned <- tune.svm(x = data.train.container@training_matrix,
y = data.train.container@training_codes,
gamma = 10^(-6:-1),
cost = 10^(-1:1),
# fill in any other SVM params as needed here
)

最新更新