希望输出一个干净的数据帧,显示模型名称、模型中使用的参数以及由此产生的评分指标。如果有一种更智能的方法来迭代度量函数(给定不同的参数),那会更好。我的目标示例图片。
到目前为止,我拥有的是:
def train_predict_score(clf, X_train, y_train, X_test, y_test):
clf = clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
result = []
result.append(roc_auc_score(y_train, y_pred_train))
result.append(roc_auc_score(y_test, y_pred_test))
result.append(cohen_kappa_score(y_train, y_pred_train))
result.append(cohen_kappa_score(y_test, y_pred_test))
result.append(f1_score(y_train, y_pred_train, pos_label=1))
result.append(f1_score(y_test, y_pred_test, pos_label=1))
result.append(precision_score(y_train, y_pred_train, pos_label=1))
result.append(precision_score(y_test, y_pred_test, pos_label=1))
result.append(recall_score(y_train, y_pred_train, pos_label=1))
result.append(recall_score(y_test, y_pred_test, pos_label=1))
return result
# Initialize default models
clf1 = LogisticRegression(random_state=0)
clf2 = DecisionTreeClassifier(random_state=0)
clf3 = RandomForestClassifier(random_state=0)
clf4 = GradientBoostingClassifier(random_state=0)
results = []
# Build initial models
for clf in [clf1, clf2, clf3, clf4]:
result = []
result.append(clf) # name and parameters - how can I show all info? it gets truncated
result.append(train_predict_score(clf, X_train, y_train, X_test, y_test)) # how to parse this out into individual columns?
results.append(result)
results = pd.DataFrame(results, columns=['clf', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 'prec_train',
'prec_test', 'recall_train', 'recall_test'])
results
通过函数进行迭代
因为函数是对象,所以可以用它们列出一个列表,然后简单地对其进行迭代。例如:
def add1(x):
return x+1
def sub1(x):
return x-1
for func in [add1, sub1]:
print(func(10))
产生
11
9
获取模型名称和参数
据我所知,您希望将模型的名称(例如LogisticRegression)及其参数存储在不同的列中。首先,你可以得到这样的参数:
clf.get_params()
这将以字典的形式返回所有模型参数。为了获得模型名称,您可以获取模型的字符串表示,并在"("上拆分一次。结果列表的第一个元素是模型的名称。因此
>>>clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
成为
>>>str(clf).split('(',1)[0]
LogisticRegression
示例
这里有一个小例子,应该做你想做的事。它在sklearn的breast_cancer数据集上训练3个不同的分类器,并将训练集和测试集上的roc_auc
、f1
、precision
和recall
得分返回为DataFrame:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
#load and split example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#classifiers with default parameters
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
clf_list = [clf1, clf2, clf3]
results_list = []
for clf in clf_list:
clf.fit(X_train, y_train)
res = {}
#extract the model name from the object string
res['Model'] = str(clf).split('(', 1)[0]
#get parameters via get_params() method
res['Parameters'] = clf.get_params()
#for every metric, record performance on train and test set
for metric_score in [roc_auc_score, f1_score, precision_score, recall_score]:
metric_name = metric_score.__name__
res[metric_name + '_train'] = metric_score(y_train, clf.predict(X_train))
res[metric_name + '_test'] = metric_score(y_test, clf.predict(X_test))
results_list.append(res)
results_df = pd.DataFrame(results_list)
由此产生的DataFrame:
print(results_df.to_string())
Model Parameters f1_test f1_train precision_test precision_train recall_test recall_train roc_au_test roc_au_train
0 LogisticRegression {'fit_intercept': True, 'warm_start': False, '... 0.922384 0.969697 0.922384 0.966038 0.922384 0.973384 0.922384 0.959085
1 RandomForestClassifier {'criterion': 'gini', 'warm_start': False, 'n_... 0.928137 0.998095 0.928137 1.000000 0.928137 0.996198 0.928137 0.998099
2 SVC {'decision_function_shape': None, 'verbose': F... 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000
注意:因为您在问题中提到了DataFrame内容被截断:例如,当您尝试在控制台中打印DF时,这种情况仅用于显示目的,就像我上面所做的那样。当您直接访问相应的单元格时,内容仍然存在。