r - Quanteda 包,朴素贝叶斯:如何预测不同特征的测试数据?



Error in predict.textmodel_NB_fitted(model, test_dfm) : 
feature set in newdata different from that in training set

生成错误的函数中的代码可以在第 157 行到 165 行找到。



1. 此错误是朴素贝叶斯算法的属性吗?还是函数的作者做出了选择?


2. 我该如何解决这个问题?


train_text <- c("Can random effects apply only to categorical variables?",
"ANOVA expectation identity",
"Statistical test for significance in ranking positions",
"Is Fisher Sharp Null Hypothesis testable?",
"List major reasons for different results from survival analysis among different studies",
"How do the tenses and aspects in English correspond temporally to one another?",
"Is there a correct gender-neutral singular pronoun (“his” vs. “her” vs. “their”)?",
"Are collective nouns always plural, or are certain ones singular?",
"What’s the rule for using “who” and “whom” correctly?",
"When is a gerund supposed to be preceded by a possessive adjective/determiner?")
train_class <- factor(c(rep(0,5), rep(1,5)))
train_dfm <- train_text %>% 
dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))
model <- textmodel_NB(train_dfm, train_class)
test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
"What do significance tests for adjusted means tell us?",
"How should I punctuate around quotes?",
"Should I put a comma before the last item in a list?")
test_dfm <- test_text %>% 
dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))
predict(model, test_dfm)


model_features <- model$data$x@Dimnames$features # gets the features of the training data
test_features <- test_dfm@Dimnames$features # gets the features of the test data
all_features <- c(model_features, test_features) %>% # combining the two sets of features...
subset(!duplicated(.)) # ...and getting rid of duplicate features
model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features
predict(model, dfm) # new error?


Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") : 
argument is of length zero


幸运的是,有一种简单的方法可以做到这一点:您可以在测试数据上使用dfm_select()来为训练集提供相同的特征(和特征的排序)。 就是这么简单:

test_dfm <- dfm_select(test_dfm, train_dfm)
predict(model, test_dfm)
## Predicted textmodel of type: Naive Bayes
##             lp(0)       lp(1)     Pr(0)  Pr(1) Predicted
## text1  -0.6931472  -0.6931472    0.5000 0.5000         0
## text2 -11.8698712 -13.1879095    0.7889 0.2111         0
## text3  -4.1484118  -3.6635616    0.3811 0.6189         1
## text4  -8.0091415  -8.4257356    0.6027 0.3973         0

截至 2018 年 5 月,现在似乎有一个"force = TRUE"选项也可以为您完成这项工作:

predict(model, test_dfm, force = TRUE)
# text1 text2 text3 text4 
#    0     0     1     0 
# Levels: 0 1

资料来源:koheiw 和 kbenoit 在 quanteda Github 上的讨论 - https://github.com/quanteda/quanteda/issues/1329
