如何修复机器学习中奇怪的完美测试分数



我在编程方面真的是新手,尤其是在机器学习方面。目前,我正在训练我的数据集,我使用KNN,随机森林和决策树作为我的算法。然而,我在随机森林和决策树中的准确性、精密度、召回率和f1分数都是1.0,这意味着有问题。另一方面,我的KNN得分很低(准确率:0.892召回率:0.452精度:0.824 f1得分:0.584)。

我已经清理和分割了我的数据集用于训练和测试,并输入了我的数据集(中位数),所以我真的很困惑为什么结果是这样的。我能做些什么来解决这个问题?

注:我真的不知道如何在这里提问,所以如果我缺少任何必要的信息,请告诉我。

dataset image: https://i.stack.imgur.com/6FR1K.png
distribution of dataset: https://i.stack.imgur.com/1uZzN.png
#Convert 0's to NaN
columns = ["Age", "Race", "Marital Status", "T Stage", "N Stage", 
"6th Stage", "Grade", "A Stage", "Tumor Size", "Estrogen Status", 
"Progesterone Status", "Regional Node Examined", "Reginol Node 
Positive", "Survival Months", "Status"]
data[columns] = data[columns].replace({'0':np.nan, 0:np.nan})
#imputing using median
imp_median.fit(data.values)
imp_median.fit(data.values)
data_median = imp_median.transform(data.values)
data_median = pd.DataFrame(data_median)
data_median.columns =["Age", "Race", "Marital Status", "T Stage ", 
"N Stage", "6th Stage", "Grade", "A Stage", "Tumor Size", "Estrogen 
Status", "Progesterone Status", "Regional Node Examined", "Reginol 
Node Positive", "Survival Months", "Status"]

#scaling data median
minmaxScale  = MinMaxScaler()
X = minmaxScale.fit_transform(data_median.values)
transformedDF = minmaxScale.transform(X)
data_transformedDF = pd.DataFrame(X)
data_transformedDF.columns =["Age", "Race", "Marital Status", "T 
Stage ", "N Stage", "6th Stage", "Grade", "A Stage", "Tumor Size", 
"Estrogen Status", "Progesterone Status", "Regional Node Examined", 
"Reginol Node Positive", "Survival Months", "Status"]
#splitting the dataset
features = data_transformedDF.drop(["Status"], axis=1)
outcome_variable = data_transformedDF["Status"]
x_train, x_test, y_train, y_test = train_test_split(features, 
outcome_variable, test_size=0.20, random_state=7)
#cross validation
def cross_validation(model, _X, _y, _cv=10):
'''
Function to perform 10 Folds Cross-Validation
Parameters
model: Python Class, default=None
This is the machine learning algorithm to be used for 
training.
_X: array
This is the matrix of features (age, race, etc).
_y: array
This is the target variable (1 - Dead, 0 - Alive).
cv: int, default=10
Determines the number of folds for cross-validation.

Returns
The function returns a dictionary containing the metrics 
'accuracy', 'precision', 'recall', 'f1' for 
training/validation set.
'''
_scoring = ['accuracy', 'precision', 'recall', 'f1']
results = cross_validate(estimator=model,
X=_X,
y=_y,
cv=_cv,
scoring=_scoring,
return_train_score=True)

return {"Training Accuracy scores": 
results['train_accuracy'], "Mean Training                           
Accuracy":results['train_accuracy'].mean()*100,
"Mean Training Precision": 
results['train_precision'].mean(),
"Mean Training Recall": 
results['train_recall'].mean(),
"Mean Training F1 Score": 
results['train_f1'].mean(),
}
#KNN
knn = KNeighborsClassifier()
cross_validation(knn, x_train, y_train, 10)
#DecisionTree
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
cross_validation(dtc, x_train, y_train, 10)
#RandomForest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
cross_validation(rfc, x_train, y_train, 10)
# Test predictions for dtc
dtc_fitted = dtc.fit(x_train, y_train)
y_pred = dtc_fitted.predict(x_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred) +
' Recall: %.3f' % recall_score(y_test, y_pred)  +
' Precision: %.3f' % precision_score(y_test, y_pred) +
' F1-score: %.3f' % f1_score(y_test, y_pred))
# Test predictions for rfc
rfc_fitted = rfc.fit(x_train, y_train)
y_pred = rfc_fitted.predict(x_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred) +
' Recall: %.3f' % recall_score(y_test, y_pred)  +
' Precision: %.3f' % precision_score(y_test, y_pred) +
' F1-score: %.3f' % f1_score(y_test, y_pred))
# Test predictions for knn
knn_fitted = knn.fit(x_train, y_train)
y_pred = knn_fitted.predict(x_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred) +
' Recall: %.3f' % recall_score(y_test, y_pred)  +
' Precision: %.3f' % precision_score(y_test, y_pred) +
' F1-score: %.3f' % f1_score(y_test, y_pred))
**For KNN**
'Mean Training Accuracy': 90.2971947134574,
'Mean Training Precision': 0.8457275536528337,
'Mean Training Recall': 0.44194341372912804,
'Mean Training F1 Score': 0.5804614758695162
test predictions for knn
Accuracy: 0.872 Recall: 0.323 Precision: 0.707 F1-score: 0.443

**For Decision Tree**
'Mean Training Accuracy': 100.0,
'Mean Training Precision': 1.0,
'Mean Training Recall': 1.0,
'Mean Training F1 Score': 1.0
test predictions for dtc:
Accuracy: 0.850 Recall: 0.528 Precision: 0.523 F1-score: 0.525
**For Random Forest**
'Mean Training Accuracy': 99.99309630652398,
'Mean Training Precision': 1.0,
'Mean Training Recall': 0.9995454545454546,
test predictions for rtc:
Accuracy: 0.896 Recall: 0.449 Precision: 0.803 F1-score: 0.576
from imblearn.over_sampling import SMOTE
smote = SMOTE()
# Oversample the training data
X_train_resampled, y_train_resampled = smote.fit_resample(x_train, 
y_train)
I ran knn, rfc, and dtc again after running the code for smote

这可能不是代码的技术问题,而是所谓的目标泄漏。

这是您的模型中的特征之一,在您的标签发生后被记录。例如,如果你预测病人是会死还是不会死,并且有一个生存日期字段,那么大多数模型都可以完美地预测结果。

KNN有点不同,因为它是一个记忆模型——它不学习变量和标签之间的关系。因此,如果它之前没有观察过,即使在目标泄漏存在的情况下,它也不会给出完美的预测。

相关内容

  • 没有找到相关文章

最新更新