用于电影评论的朴素贝叶斯分类器的准确性非常低,尽管尝试了几次特征选择



我对机器学习相当陌生,任务是构建机器学习模型,预测评论是好(1(还是坏(0(。我已经尝试使用输出精度为 50% 的RandomForestClassifier。我切换到朴素贝叶斯分类器,但即使在进行网格搜索后仍然没有得到任何改进。

我的数据看起来像这样(我很乐意与任何人分享数据(:

Reviews  Labels
0      For fans of Chris Farley, this is probably his...       1
1      Fantastic, Madonna at her finest, the film is ...       1
2      From a perspective that it is possible to make...       1
3      What is often neglected about Harold Lloyd is ...       1
4      You'll either love or hate movies such as this...       1
...     ...
14995  This is perhaps the worst movie I have ever se...       0
14996  I was so looking forward to seeing this film t...       0
14997  It pains me to see an awesome movie turn into ...       0
14998  "Grande Ecole" is not an artful exploration of...       0
14999  I felt like I was watching an example of how n...       0
[15000 rows x 2 columns]

在训练分类器之前,我预处理然后文本并使用TfidfVectorizer的代码是这样的:

vect = TfidfVectorizer(stop_words=stopwords, max_features=5000)
X_train =vect.fit_transform(all_train_set['Reviews'])
y_train = all_train_set['Labels']
clf = MultinomialNB()
clf.fit(X_train, y_train)
X_test = vect.transform(all_test_set['Reviews'])
y_test = all_test_set['Labels']
print(classification_report(y_test, clf.predict(X_test), digits=4))

分类报告的结果似乎表明,虽然一个标签预测得很好,但另一个标签非常差,使整个事情下降。

precision    recall  f1-score   support
0     0.5000    0.8546    0.6309      2482
1     0.5000    0.1454    0.2253      2482
accuracy                         0.5000      4964
macro avg     0.5000    0.5000    0.4281      4964
weighted avg     0.5000    0.5000    0.4281      4964

我已经尝试遵循 8 个不同的教程,并尝试了每种不同的编码方式,但我似乎无法将其超过 50%,这让我认为这可能是我的功能问题。

如果有人有任何想法或建议,我将不胜感激。

编辑: 好的,所以我在这里添加了一些预处理步骤,包括删除 html 标签、删除标点符号和单个字母以及从下面的代码中删除多个空格:

TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def preprocess_text(sen):
# Removing html tags
sentence = remove_tags(sen)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Single character removal
sentence = re.sub(r"s+[a-zA-Z]s+", ' ', sentence)
# Removing multiple spaces
sentence = re.sub(r's+', ' ', sentence)
return sentence

我相信TfidfVectorizer会自动将所有内容设置为小写并对其进行词形还原。最终结果仍然只有0.5

文本预处理在这里非常重要。仅删除停用词是不够的,我认为您应该考虑以下几点:

  • 将文本转换为小写
  • 删除标点符号
  • 撇号查找("'ll" -> ">
  • will"', "'ve" -> " have"(
  • 删除号码
  • 用于评论的词形还原和/或词干提取
  • 等。

看一下文本预处理方法。

最新更新