在有组织的panda数据帧中遍历函数并输出结果



希望输出一个干净的数据帧,显示模型名称、模型中使用的参数以及由此产生的评分指标。如果有一种更智能的方法来迭代度量函数(给定不同的参数),那会更好。我的目标示例图片。

到目前为止,我拥有的是:

def train_predict_score(clf, X_train, y_train, X_test, y_test):
clf = clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
result = []
result.append(roc_auc_score(y_train, y_pred_train))
result.append(roc_auc_score(y_test, y_pred_test))
result.append(cohen_kappa_score(y_train, y_pred_train))
result.append(cohen_kappa_score(y_test, y_pred_test))
result.append(f1_score(y_train, y_pred_train, pos_label=1))
result.append(f1_score(y_test, y_pred_test, pos_label=1))
result.append(precision_score(y_train, y_pred_train, pos_label=1))
result.append(precision_score(y_test, y_pred_test, pos_label=1))
result.append(recall_score(y_train, y_pred_train, pos_label=1))
result.append(recall_score(y_test, y_pred_test, pos_label=1))
return result
# Initialize default models
clf1 = LogisticRegression(random_state=0)
clf2 = DecisionTreeClassifier(random_state=0)
clf3 = RandomForestClassifier(random_state=0)
clf4 = GradientBoostingClassifier(random_state=0)
results = []
# Build initial models
for clf in [clf1, clf2, clf3, clf4]:
result = []
result.append(clf) # name and parameters - how can I show all info? it gets truncated
result.append(train_predict_score(clf, X_train, y_train, X_test, y_test)) # how to parse this out into individual columns?
results.append(result)
results = pd.DataFrame(results, columns=['clf', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 'prec_train',
'prec_test', 'recall_train', 'recall_test'])
results

通过函数进行迭代

因为函数是对象,所以可以用它们列出一个列表,然后简单地对其进行迭代。例如:

def add1(x):
return x+1
def sub1(x):
return x-1
for func in [add1, sub1]:
print(func(10))

产生

11
9

获取模型名称和参数

据我所知,您希望将模型的名称(例如LogisticRegression)及其参数存储在不同的列中。首先,你可以得到这样的参数:

clf.get_params()

这将以字典的形式返回所有模型参数。为了获得模型名称,您可以获取模型的字符串表示,并在"("上拆分一次。结果列表的第一个元素是模型的名称。因此

>>>clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

成为

>>>str(clf).split('(',1)[0]
LogisticRegression

示例

这里有一个小例子,应该做你想做的事。它在sklearn的breast_cancer数据集上训练3个不同的分类器,并将训练集和测试集上的roc_aucf1precisionrecall得分返回为DataFrame:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
#load and split example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#classifiers with default parameters
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
clf_list = [clf1, clf2, clf3]
results_list = []
for clf in clf_list:
clf.fit(X_train, y_train)
res = {}
#extract the model name from the object string
res['Model'] = str(clf).split('(', 1)[0]
#get parameters via get_params() method
res['Parameters'] = clf.get_params()
#for every metric, record performance on train and test set
for metric_score in [roc_auc_score, f1_score, precision_score, recall_score]:
metric_name = metric_score.__name__
res[metric_name + '_train'] = metric_score(y_train, clf.predict(X_train))
res[metric_name + '_test'] = metric_score(y_test, clf.predict(X_test))
results_list.append(res)
results_df = pd.DataFrame(results_list)

由此产生的DataFrame:

print(results_df.to_string())
Model                                         Parameters   f1_test  f1_train  precision_test  precision_train  recall_test  recall_train  roc_au_test  roc_au_train
0      LogisticRegression  {'fit_intercept': True, 'warm_start': False, '...  0.922384  0.969697        0.922384         0.966038     0.922384      0.973384     0.922384      0.959085
1  RandomForestClassifier  {'criterion': 'gini', 'warm_start': False, 'n_...  0.928137  0.998095        0.928137         1.000000     0.928137      0.996198     0.928137      0.998099
2                     SVC  {'decision_function_shape': None, 'verbose': F...  0.500000  1.000000        0.500000         1.000000     0.500000      1.000000     0.500000      1.000000

注意:因为您在问题中提到了DataFrame内容被截断:例如,当您尝试在控制台中打印DF时,这种情况仅用于显示目的,就像我上面所做的那样。当您直接访问相应的单元格时,内容仍然存在。

最新更新