编辑:请分享评论,因为我正在学习发布好的问题
我试图用IsolationForest()
训练这个数据集,我需要训练这个数据集中,并在另一个质量改变的数据集中使用它来预测质量值,并提取所有质量为8和9的葡萄酒。
但是我有一些问题。因为分类报告中的准确度分数是0.0
:
print(classification_report(y_test, prediction))
precision recall f1-score support
-1 0.00 0.00 0.00 0.0
1 0.00 0.00 0.00 0.0
3 0.00 0.00 0.00 866.0
4 0.00 0.00 0.00 829.0
5 0.00 0.00 0.00 841.0
6 0.00 0.00 0.00 861.0
7 0.00 0.00 0.00 822.0
8 0.00 0.00 0.00 886.0
9 0.00 0.00 0.00 851.0
accuracy 0.00 5956.0
macro avg 0.00 0.00 0.00 5956.0
weighted avg 0.00 0.00 0.00 5956.0
我不知道这是一个超参数问题,还是我清除了错误的数据或输入了错误的参数,我已经尝试过使用SMOTE,而没有SMOTE的情况下,我希望至少达到90%的准确率。
我将保留共享驱动器链接以供数据集验证:
https://drive.google.com/drive/folders/18_sOSIZZw9DCW7ftEKuOG4aIzGXoasFe?usp=sharing
这是我的代码:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix
df = pd.read_csv('wines.csv')
df.head(5)
ordinalEncoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-99).fit(df[['color']])
df[['color']] = ordinalEncoder.transform(df[['color']])
df.info()
df['color'] = df['color'].astype(int)
df.head(3)
stm = SMOTE(k_neighbors=4)
x_smote = df.drop('quality',axis=1)
y_smote = df['quality']
x_smote,y_smote = stm.fit_resample(x_smote,y_smote)
print(x_smote.shape,y_smote.shape)
x_smote.columns
scaler = StandardScaler()
X = scaler.fit_transform(x_smote)
y = y_smote
X.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
from sklearn.ensemble import IsolationForest
from sklearn.metrics import hamming_loss
iforest = IsolationForest(n_estimators=200, max_samples=0.1, contamination=0.10, max_features=1.0, bootstrap=False, n_jobs=-1,
random_state=None, verbose=0, warm_start=False)
iforest_fit = iforest.fit(x_train,y_train)
prediction = iforest_fit.predict(x_test)
print (prediction.shape, y_test.shape)
y.value_counts()
prediction
print(confusion_matrix(y_test, prediction))
hamming_loss(y_test, prediction)
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))
我可以知道你为什么选择隔离林作为你的模型吗?本文指出,孤立森林是一种用于异常检测的无监督学习算法。
当我打印一些预测样本(由Isolation Forest(和实际真相样本时,我会得到以下结果,所以你知道为什么准确度得分是0.0:
print(list(prediction[0:15]))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print(list(y_test[0:15]))
[9, 4, 4, 7, 9, 3, 6, 7, 4, 8, 8, 7, 3, 8, 5]
wines.csv
数据集和您的代码都指向多类分类问题。我选择了RandomForestClassifier()
来继续您代码的第二部分:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss
model = RandomForestClassifier()
model.fit(x_train,y_train)
prediction = model.predict(x_test)
print(prediction[0:15]) #see 15 samples of prediction
[3, 9, 5, 5, 7, 9, 7, 6, 9, 8, 5, 9, 8, 3, 3]
print(list(y_test[0:15])) #see 15 samples of actual truth
[3, 9, 5, 6, 6, 9, 7, 5, 9, 8, 5, 9, 8, 3, 3]
print(confusion_matrix(y_test, prediction))
[[842 0 0 0 0 0 0]
[ 2 815 17 8 1 1 0]
[ 8 50 690 130 26 2 0]
[ 2 28 152 531 128 16 0]
[ 4 1 15 66 716 32 3]
[ 0 1 0 4 12 833 0]
[ 0 0 0 0 0 0 820]]
print('hamming_loss =', hamming_loss(y_test, prediction))
hamming_loss = 0.11903962390866353
print(classification_report(y_test, prediction))
precision recall f1-score support
3 0.98 1.00 0.99 842
4 0.91 0.97 0.94 844
5 0.79 0.76 0.78 906
6 0.72 0.62 0.67 857
7 0.81 0.86 0.83 837
8 0.94 0.98 0.96 850
9 1.00 1.00 1.00 820
accuracy 0.88 5956
macro avg 0.88 0.88 0.88 5956
weighted avg 0.88 0.88 0.88 5956
甚至在调整超参数之前,精度就已经是0.88了。