0.7-0.75是朴素贝叶斯情绪分析的可接受精度吗



我提前为发布这么多代码道歉。

我试图将YouTube评论分为包含意见的评论(无论是正面的还是负面的(和不使用NLTK的朴素贝叶斯分类器的评论,但无论我在预处理阶段做什么,我都无法真正达到0.75以上的准确度。与我看到的其他例子相比,这似乎有点低——例如,本教程的准确度约为0.98。

这是我的完整代码

import nltk, re, json, random
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist, classify, NaiveBayesClassifier
from contractions import CONTRACTION_MAP
from abbreviations import abbrev_map
from tqdm.notebook import tqdm
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
text = re.sub(r"’", "'", text)
if text in abbrev_map:
return(abbrev_map[text])
text = re.sub(r"bluv", "lov", text)

contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contraction_mapping.get(match)
if contraction_mapping.get(match)
else contraction_mapping.get(match.lower())                       
expanded_contraction = first_char+expanded_contraction[1:]
return expanded_contraction

expanded_text = contractions_pattern.sub(expand_match, text)
return expanded_text
def reduce_lengthening(text):
pattern = re.compile(r"(.)1{2,}")
return pattern.sub(r"11", text)
def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence
def processor(comments_list):

new_comments_list = []
for com in tqdm(comments_list):
com = com.lower()

#expand out contractions
tok = com.split(" ")
z = []
for w in tok:
ex_w = expand_contractions(w)
z.append(ex_w)
st = " ".join(z)


tokenized = tokenizer.tokenize(st)
reduced = [reduce_lengthening(token) for token in tokenized]
new_comments_list.append(reduced)

lemmatized = [lemmatize_sentence(new_com) for new_com in new_comments_list]

return(lemmatized)
def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
def get_comments_for_model(cleaned_tokens_list):
for comment_tokens in cleaned_tokens_list:
yield dict([token, True] for token in comment_tokens)

if __name__ == "__main__":
#=================================================================================~
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)        

with open ("english_lang/samples/training_set.json", "r", encoding="utf8") as f:
train_data = json.load(f)

pos_processed = processor(train_data['pos'])
neg_processed = processor(train_data['neg'])
neu_processed = processor(train_data['neu'])

emotion = pos_processed + neg_processed
random.shuffle(emotion)

em_tokens_for_model = get_comments_for_model(emotion)
neu_tokens_for_model = get_comments_for_model(neu_processed)
em_dataset = [(comment_dict, "Emotion")
for comment_dict in em_tokens_for_model]
neu_dataset = [(comment_dict, "Neutral")
for comment_dict in neu_tokens_for_model]
dataset = em_dataset + neu_dataset

random.shuffle(dataset)
x = 700
tr_data = dataset[:x]
te_data = dataset[x:]
classifier = NaiveBayesClassifier.train(tr_data)
print(classify.accuracy(classifier, te_data))

如果需要,我可以发布我的训练数据集,但可能值得一提的是,YouTube评论本身的英语质量非常差,而且不一致(我想这就是模型准确性低的原因(。在任何情况下,这是否被认为是一个可接受的准确性水平?或者,我可能完全错了,而且有一个非常好的模型可以使用,在这种情况下,可以告诉我我是个白痴!提前感谢

将您的结果与无关教程的结果进行比较在统计上是无效的。在你恐慌之前,请对可能降低模型准确性的因素进行适当的研究。首先,您的模型不能表现出比数据集信息中固有的精度更高的精度。例如,无论数据集如何,任何模型在预测随机二进制事件方面的表现(从长远来看(都不能超过50%。

我们没有合理的方法来评估理论信息的内容。如果您需要检查,请尝试将其他一些模型类型应用于相同的数据,并查看它们的准确性。进行这些实验是数据科学的正常组成部分。

相关内容

最新更新