我使用 sklearn 将决策树与 K-fold 应用,有人可以帮助我显示它的平均分数。下面是我的代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report
dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")
X=dta.drop("whether he/she donated blood in March 2007",axis=1)
X=X.values # convert dataframe to numpy array
y=dta["whether he/she donated blood in March 2007"]
y=y.values # convert dataframe to numpy array
kf = KFold(n_splits=10)
KFold(n_splits=10, random_state=None, shuffle=False)
clf_tree=DecisionTreeClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf=clf_tree.fit(X_train,y_train)
print("classification_report_tree",
classification_report(y_test,clf_tree.predict(X_test)))
如果你只想要准确性,那么你可以简单地使用cross_val_score()
kf = KFold(n_splits=10)
clf_tree=DecisionTreeClassifier()
scores = cross_val_score(clf_tree, X, y, cv=kf)
avg_score = np.mean(score_array)
print(avg_score)
在这里,cross_val_score
将输入您的原始 X 和 y(不拆分为训练和测试(。 cross_val_score
会自动将它们拆分为训练和测试,在训练数据上拟合模型,在测试数据上评分。这些分数将在scores
变量中返回。
折叠时,scores
变量中将返回 10 个分数。然后你可以取平均值。
你可以尝试从sklearn Precision_reacll_fscore_support指标,然后得到每个类每个折叠的结果的平均值。我在这里假设你需要每节课的平均分数。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import GridSearchCV,cross_val_score
dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")
X=dta.drop("whether he/she donated blood in March 2007",axis=1)
X=X.values # convert dataframe to numpy array
y=dta["whether he/she donated blood in March 2007"]
y=y.values # convert dataframe to numpy array
kf = KFold(n_splits=10)
KFold(n_splits=10, random_state=None, shuffle=False)
clf_tree=DecisionTreeClassifier()
score_array =[]
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf=clf_tree.fit(X_train,y_train)
y_pred = clf.predict(X_test)
score_array.append(precision_recall_fscore_support(y_test, y_pred, average=None))
avg_score = np.mean(score_array,axis=0)
print(avg_score)
#Output:
#[[ 0.77302466 0.30042282]
# [ 0.81755068 0.22192344]
# [ 0.79063779 0.24414489]
# [ 57. 17.8 ]]
现在要获得类 0 的精度,您可以使用 avg_score[0][0]
.召回可以通过第二行访问(即对于 0 类,它是 avg_score[1][0]
(,而 fscore 和支持可以分别从第 3 行和第 4 行访问。