为什么在XGBClassifier中调用fit重置自定义目标函数



我已尝试根据以下文档设置XGBoost sklearn APIXGBClassifier以使用自定义目标函数(brier(:

.. note::  Custom objective function
A custom objective function can be provided for the ``objective``
parameter. In this case, it should have the signature
``objective(y_true, y_pred) -> grad, hess``:
y_true: array_like of shape [n_samples]
The target values
y_pred: array_like of shape [n_samples]
The predicted values
grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

这是我的尝试:

import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('~/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]
def brier(y_true, y_pred):
y_pred = 1.0 / (1.0 + np.exp(-y_pred))
grad = 2 * y_pred * (y_true - y_pred) * (y_pred - 1)
hess = 2 * y_pred ** (1 - y_pred) * (2 * y_pred * (y_true + 1) - y_true - 3 * y_pred ** 2)
return grad, hess
m = XGBClassifier(objective=brier, seed=42)

它似乎产生了正确的对象:

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
objective=<function brier at 0x7fe7ac418290>, random_state=None,
reg_alpha=None, reg_lambda=None, scale_pos_weight=None, seed=42,
subsample=None, tree_method=None, validate_parameters=False,
verbosity=None)

然而,调用.fit方法似乎会将m对象重置为默认设置:

m.fit(X, y)
m
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints=None,
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=0, num_parallel_tree=1,
objective='binary:logistic', random_state=42, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=42, subsample=1,
tree_method=None, validate_parameters=False, verbosity=None)

具有CCD_ 5。我注意到,在研究为什么直接为brier优化时,我的brier分数比使用默认binary:logistic时差,如本文所述。

那么,如何正确设置XGBClassifier以使用我的函数brier作为自定义目标呢?

我相信你把objective和objective函数(obj作为参数(搞错了,xgboost文档有时会很混乱。

简而言之,你只需要解决这个问题:

m = XGBClassifier(obj=brier, seed=42)

更深入地说,目标是xgboost在给定目标函数的情况下如何优化。通常,xgboost根据y向量中的类数推断优化。

我从源代码中提取了一个片段,正如你所看到的,只要你只有两个类,目标就会被设置为二进制:逻辑:

class XGBClassifier(XGBModel, XGBClassifierBase):
def __init__(self, objective="binary:logistic", **kwargs):
super().__init__(objective=objective, **kwargs)
def fit(self, X, y, sample_weight=None, base_margin=None,
eval_set=None, eval_metric=None,
early_stopping_rounds=None, verbose=True, xgb_model=None,
sample_weight_eval_set=None, callbacks=None):
evals_result = {}
self.classes_ = np.unique(y)
self.n_classes_ = len(self.classes_)
xgb_options = self.get_xgb_params() # <-- obj function is set here
if callable(self.objective):
obj = _objective_decorator(self.objective) # <----- here is the mismatch of the names, if you pass objective as your brie func it will become "binary:logistic"
xgb_options["objective"] = "binary:logistic"
else:
obj = None
if self.n_classes_ > 2:
xgb_options['objective'] = 'multi:softprob' # <----- objective is being set here if n_classes> 2
xgb_options['num_class'] = self.n_classes_
+-- 35 lines: feval = eval_metric if callable(eval_metric) else None-----------------------------------------------------------------------------------------------------------------------------------------------------
self._Booster = train(xgb_options, train_dmatrix, # <----- objective is being passed in xgb_options dictionary
self.get_num_boosting_rounds(),
evals=evals,
early_stopping_rounds=early_stopping_rounds,
evals_result=evals_result, obj=obj, feval=feval, # <----- obj function is being passed to lower level api here
verbose_eval=verbose, xgb_model=xgb_model,
callbacks=callbacks)
+-- 12 lines: self.objective = xgb_options["objective"]------------------------------------------------------------------------------------------------------------------------------------------------------------------
return self

有一个固定的目标列表你可以设置的目标列表:

目标[默认值=reg:平方误差]

reg:squarederror: regression with squared loss.
reg:squaredlogerror: regression with squared log loss 12[𝑙𝑜𝑔(𝑝𝑟𝑒𝑑+1)−𝑙𝑜𝑔(𝑙𝑎𝑏𝑒𝑙+1)]2. All input labels are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.
reg:logistic: logistic regression
binary:logistic: logistic regression for binary classification, output probability
binary:logitraw: logistic regression for binary classification, output score before logistic transformation
binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
count:poisson –poisson regression for count data, output mean of poisson distribution
max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).
multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

只需确认目标不能是您的brie函数,就可以在调用较低级别的api 之前,在源代码中手动将目标设置为brie函数

class XGBClassifier(XGBModel, XGBClassifierBase):
def __init__(self, objective="binary:logistic", **kwargs):
super().__init__(objective=objective, **kwargs)
def fit(self, X, y, sample_weight=None, base_margin=None,
eval_set=None, eval_metric=None,
early_stopping_rounds=None, verbose=True, xgb_model=None,
sample_weight_eval_set=None, callbacks=None):
+-- 54 lines: evals_result = {}--------------------------------------------------------------------
xgb_options["objective"] = xgb_options["obj"]
self._Booster = train(xgb_options, train_dmatrix,
self.get_num_boosting_rounds(),
evals=evals,
early_stopping_rounds=early_stopping_rounds,
evals_result=evals_result, obj=obj, feval=feval,
verbose_eval=verbose, xgb_model=xgb_model,
callbacks=callbacks)
+-- 14 lines: self.objective = xgb_options["objective"]--------------------------------------------

抛出此错误:

raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [10:09:53] /private/var/folders/z5/mchb9bz51cx3h97nkw9v0wkr0000gn/T/pip-install-kh801rm0/xgboost/xgboost/src/objective/objective.cc:26: Unknown objective function: `<function brier at 0x10b630d08>`
Objective candidate: binary:hinge
Objective candidate: multi:softmax
Objective candidate: multi:softprob
Objective candidate: rank:pairwise
Objective candidate: rank:ndcg
Objective candidate: rank:map
Objective candidate: reg:squarederror
Objective candidate: reg:squaredlogerror
Objective candidate: reg:logistic
Objective candidate: binary:logistic
Objective candidate: binary:logitraw
Objective candidate: reg:linear
Objective candidate: count:poisson
Objective candidate: survival:cox
Objective candidate: reg:gamma
Objective candidate: reg:tweedie

最新更新