为什么我所有的分类准确度分数都一样



我正在运行几个机器学习模型,以找到准确度得分最高的模型,然而,所有的准确度得分都完全相同。我在社交媒体文本上执行了NLP,我正在训练我的模型,根据NLTK确定的情绪来标记情绪。

我使用相同的训练和测试集,但我以前也用过这种方法,在不同的模型上得到了不同的分数。为什么我的都一样?我是不是太适合了?

这是我进行拆分和训练的代码:

submissions_sentiment = submissions_df[["Clean_Body", "Clean_Title", "sentiment_label"]]
dataset = submissions_sentiment
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1].values
X_arr = []
for index, row in X.iterrows():
X_arr.append(row.values)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_arr, y, test_size = 0.2, random_state = 0)
def identity_tokenizer(text):
return text
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=identity_tokenizer, lowercase=False)
# fit AND transform the model (only for training data)
X_train_vectors = vectorizer.fit_transform(X_train)
# transform the test data
X_test_vectors = vectorizer.transform(X_test)
# Linear SVM
from sklearn import svm
clf_svm = svm.SVC(kernel="linear")
clf_svm.fit(X_train_vectors, y_train)
clf_svm_pred = clf_svm.predict(X_test_vectors)
# Evaluate Model Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_svm_pred) 
# Output is .86
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(X_train_vectors, y_train)
clf_gnb_pred = clf_gnb.predict(X_test_vectors)
# Evaluate Model Accuracy
accuracy_score(y_test, clf_gnb_pred)
# Output is .86

以下是X-train的一个例子:

# Review data ouput
print(X_train_vectors.toarray())
print(X_train[0])
print(X_train_vectors[0])
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
['I started really investing this year and looking for long term holdings After about 5 months or so I have decided to start putting money into ETFs for the time being while I research and learn about companies more For ETFs Im thinking about are the followingVOOQQQIm looking for another ETF that is not apart of Tech to kind of help diversify my holdings I was wondering if XLC would be a good third ETF My plan right now is each month put X amount into a single ETF then the next month put it into the next ETF etc and essentially continously put money into all three ETFs Im in my late 20s and my goal is to hold long term 10  15 years or longer If anyone has suggestions on other ETFs I would greatly appreciate it as Im trying to find the right ETFs to get into and hopefully grow over timeThank you in advance'
'What 3 ETFs are good to diversify with and buy into']
(0, 517)  1
(0, 1007) 1

这里的y-train是1(正(。

以下是y_test和来自Kernel SVM的预测:

print(y_test)
[ 1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1 -1
1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1
1 -1  1  1  1  1  1 -1  1  1  1  1  1 -1 -1 -1  1 -1  1  1  1 -1  1 -1
1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1  1  1  1  1  1 -1
1 -1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1
1  1]
print(clf_svm_pred)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

依此类推。决策树的输出相同。

我做错什么了吗?

我不确定问题的原因是什么,但由于SVM模型和DecisionTreeClassifier的输出总是1,我建议您尝试一个更复杂的模型,如RandomForestClassifier,看看结果如何。

我以前也有过类似的经历,无论我如何调整训练超参数,模型总是给出相同的性能指标——这可能是由两种概率引起的:

  1. 我们的数据不适合模型,例如向量中的所有值都为零:[0,0,0,0,0,0]
  2. 我们的模型过于简单,只能进行线性建模,因此无法学习太复杂的映射函数

既然你的SVM是用线性内核构建的,你能尝试一个更复杂的模型,看看它会产生什么吗?你能检验一下,如果你的X_train_vector在矩阵中都是零吗?

最新更新