为什么"StratifiedShuffleSplit"对数据集的每个拆分都给出相同的结果？

我使用StratifiedShuffleSplit来重复分割数据集、拟合、预测和计算度量的过程。你能解释一下为什么每次分割都会得到相同的结果吗?

import csv
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report
clf = RandomForestClassifier(max_depth = 5)
df = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/BigData/main/cll_dataset.csv")
X, y = df.iloc[:, 1:], df.iloc[:, 0]
sss = StratifiedShuffleSplit(n_splits = 5, test_size = 0.25, random_state = 0).split(X, y)
for train_ind, test_ind in sss:
X_train, X_test = X.loc[train_ind], X.loc[test_ind]
y_train, y_test = y.loc[train_ind], y.loc[test_ind]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred, zero_division = 0, output_dict = True)
report = pd.DataFrame(report).T
report = report[:2]
print(report)

结果

precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0

您构建的每个模型都预测输出总是类别0，并且，由于您进行了分层分割(总是具有与X相同的类别0和类别1的比例)，因此您总是预测完全相同的值。

当模型总是预测类别0时，比"学习"获得更好的精度。某种模式或规则。这是一个大问题。要解决这个问题，您可以使用以下选项:

尝试修改随机森林算法的一些超参数。
收集更多的数据，以获得更大的数据集，你只测试8个样本(也许是难以为您获取新数据)
你有不平衡的数据(0类样本多于1类样本)，你应该考虑使用SMOTE库

相关内容

最新更新

热门标签：