cross_val_score默认评分不一致

根据文档，

对于cross_val_score的scoring参数：如果无，则使用估计器的默认得分手(如果可用(

对于DecisionTreeRegressor，默认标准为mse。那么，为什么我在这里得到了不同的结果呢？

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
- cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
>>> array([ 46.94808341,  18.78121305,  18.19914701,  18.06935431,
17.19546733,  28.91247609,  39.41410887,  21.30453162,
31.96443414,  23.74191199])

cross_val_score(dt, X_train, y_train, cv=10)
>>> array([ 0.35723619,  0.75254466,  0.7181376 ,  0.65718608,  0.72531937,
0.4752839 ,  0.43169728,  0.63916363,  0.41406146,  0.68977882])

如果非要我猜测的话，默认的scoring似乎是R2而不是mse。我对默认得分手的理解是正确的，还是这是一个错误？

DecisionTreeRegression的默认得分手是r2-score，您可以在DecisionTreeRevision的文档中找到它。

score(self, X, y, sample_weight=None)[source]
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

@PV8肯定是对的，但我想指出两个细节。

细节#1：如何使用r2-score作为评分指标？答案：make_scorer

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
print(cross_val_score(dt, X_train, y_train, cv=10, scoring=make_scorer(r2_score)))

如果你多次运行这个程序，你仍然会得到不同的结果。

细节#2：如何获得一致的结果？

您需要设置random_state变量以获得恒定的结果。

例如：

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
print(cross_val_score(dt, X_train, y_train, cv=10, scoring=make_scorer(r2_score)))

结果总是一样的。

相关内容

最新更新

热门标签：