在Windows上使用scikit-learn与Python 2.7一起使用,我的代码计算AUC有什么问题?谢谢。
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
#print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="precision")
#print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="recall")
print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="roc_auc")
Traceback (most recent call last):
File "C:/Users/foo/PycharmProjects/CodeExercise/decisionTree.py", line 8, in <module>
print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="roc_auc")
File "C:Python27libsite-packagessklearncross_validation.py", line 1433, in cross_val_score
for train, test in cv)
File "C:Python27libsite-packagessklearnexternalsjoblibparallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "C:Python27libsite-packagessklearnexternalsjoblibparallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "C:Python27libsite-packagessklearnexternalsjoblibparallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "C:Python27libsite-packagessklearnexternalsjoblibparallel.py", line 180, in __init__
self.results = batch()
File "C:Python27libsite-packagessklearnexternalsjoblibparallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:Python27libsite-packagessklearncross_validation.py", line 1550, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "C:Python27libsite-packagessklearncross_validation.py", line 1606, in _score
score = scorer(estimator, X_test, y_test)
File "C:Python27libsite-packagessklearnmetricsscorer.py", line 159, in __call__
raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported
编辑1,看起来scikit学习甚至可以在没有任何机器学习模型的情况下决定阈值,想知道为什么,
import numpy as np
from sklearn.metrics import roc_curve
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
print fpr
print tpr
print thresholds
sklearn
中的roc_auc
只适用于二进制类:
解决这个问题的一种方法是将你的标签二值化,并将你的分类扩展到一个一对一的模式。在sklearn中,你可以使用sklearn.preprocessing.LabelBinarizer
。文档在这里:
关于你在'Edit 1'下发布的问题的第二部分:
- roc_curve函数没有找到预测的最佳阈值
- roc_curve通过从0到1的不同阈值生成tpr和fpr集合[给定y_true和y_prob(正类概率)]
- 一般来说,如果roc_auc值高,那么分类器是好的。但是,当使用分类器进行预测 时,您仍然需要找到使指标(如F1分数)最大化的最佳阈值。
- 在ROC曲线中,最佳阈值将对应于ROC曲线上离对角线(fpr = tpr线)最大距离的点