孤立森林的准确度得分为0.0



编辑:请分享评论,因为我正在学习发布好的问题

我试图用IsolationForest()训练这个数据集,我需要训练这个数据集中,并在另一个质量改变的数据集中使用它来预测质量值,并提取所有质量为8和9的葡萄酒。

但是我有一些问题。因为分类报告中的准确度分数是0.0

print(classification_report(y_test, prediction))
precision    recall  f1-score   support
-1       0.00      0.00      0.00       0.0
1       0.00      0.00      0.00       0.0
3       0.00      0.00      0.00     866.0
4       0.00      0.00      0.00     829.0
5       0.00      0.00      0.00     841.0
6       0.00      0.00      0.00     861.0
7       0.00      0.00      0.00     822.0
8       0.00      0.00      0.00     886.0
9       0.00      0.00      0.00     851.0
accuracy                           0.00    5956.0
macro avg       0.00      0.00      0.00    5956.0
weighted avg       0.00      0.00      0.00    5956.0

我不知道这是一个超参数问题,还是我清除了错误的数据或输入了错误的参数,我已经尝试过使用SMOTE,而没有SMOTE的情况下,我希望至少达到90%的准确率。

我将保留共享驱动器链接以供数据集验证:

https://drive.google.com/drive/folders/18_sOSIZZw9DCW7ftEKuOG4aIzGXoasFe?usp=sharing

这是我的代码:

from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix
df = pd.read_csv('wines.csv')
df.head(5)
ordinalEncoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-99).fit(df[['color']])
df[['color']] = ordinalEncoder.transform(df[['color']])
df.info()
df['color'] = df['color'].astype(int)
df.head(3)
stm = SMOTE(k_neighbors=4)
x_smote = df.drop('quality',axis=1)
y_smote = df['quality']
x_smote,y_smote = stm.fit_resample(x_smote,y_smote)
print(x_smote.shape,y_smote.shape)
x_smote.columns
scaler = StandardScaler()
X = scaler.fit_transform(x_smote)
y = y_smote
X.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
from sklearn.ensemble import IsolationForest
from sklearn.metrics import hamming_loss
iforest = IsolationForest(n_estimators=200, max_samples=0.1, contamination=0.10, max_features=1.0, bootstrap=False, n_jobs=-1, 
random_state=None, verbose=0, warm_start=False)
iforest_fit = iforest.fit(x_train,y_train)
prediction = iforest_fit.predict(x_test)
print (prediction.shape, y_test.shape)
y.value_counts()
prediction
print(confusion_matrix(y_test, prediction))
hamming_loss(y_test, prediction)
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))

我可以知道你为什么选择隔离林作为你的模型吗?本文指出,孤立森林是一种用于异常检测的无监督学习算法。

当我打印一些预测样本(由Isolation Forest(和实际真相样本时,我会得到以下结果,所以你知道为什么准确度得分是0.0:

print(list(prediction[0:15]))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print(list(y_test[0:15]))
[9, 4, 4, 7, 9, 3, 6, 7, 4, 8, 8, 7, 3, 8, 5]

wines.csv数据集和您的代码都指向多类分类问题。我选择了RandomForestClassifier()来继续您代码的第二部分:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss
model = RandomForestClassifier()
model.fit(x_train,y_train)
prediction = model.predict(x_test)
print(prediction[0:15])    #see 15 samples of prediction
[3, 9, 5, 5, 7, 9, 7, 6, 9, 8, 5, 9, 8, 3, 3]
print(list(y_test[0:15]))    #see 15 samples of actual truth
[3, 9, 5, 6, 6, 9, 7, 5, 9, 8, 5, 9, 8, 3, 3]
print(confusion_matrix(y_test, prediction))
[[842   0   0   0   0   0   0]
[  2 815  17   8   1   1   0]
[  8  50 690 130  26   2   0]
[  2  28 152 531 128  16   0]
[  4   1  15  66 716  32   3]
[  0   1   0   4  12 833   0]
[  0   0   0   0   0   0 820]]
print('hamming_loss =', hamming_loss(y_test, prediction))
hamming_loss = 0.11903962390866353
print(classification_report(y_test, prediction))
precision    recall  f1-score   support
3       0.98      1.00      0.99       842
4       0.91      0.97      0.94       844
5       0.79      0.76      0.78       906
6       0.72      0.62      0.67       857
7       0.81      0.86      0.83       837
8       0.94      0.98      0.96       850
9       1.00      1.00      1.00       820
accuracy                           0.88      5956
macro avg       0.88      0.88      0.88      5956
weighted avg       0.88      0.88      0.88      5956

甚至在调整超参数之前,精度就已经是0.88了。

最新更新