关于奥普图纳"n_trials"的问题



我试图使用optuna来调优xgboost的超参数,但由于内存限制,我不能将属性n_trials设置得太高,否则它会报告MemoryError,所以我想知道如果我设置n_trials=5并运行程序4次,结果是否与我设置n_trials=20并运行程序一次相似?

可以,如果您在不同的运行中使用相同的数据库来存储研究。

  • 我改变了我之前的回答:

这是不一样的,这就像玩了1/4的游戏,重新开始。

xgboost在fit()方法中有一个参数xgb_model用于增量训练xgboost。

例如,基本上,n_trials将保持为20。数据集将以块的形式读取。

模型必须在拟合第一个块后保存。第二个块将使用这个保存的模型,如果还有更多的块,再次保存模型以供下一个块使用,依此类推。

此外,检查内存泄漏也会很好,这可能会导致问题。此外,理想情况下,n_estimators不应该太高,1000或以下就可以了。太高的话,它会变慢并使用更多的内存。与max_depth相同,我只使用6到13。

下面是Optuna目标函数的代码片段,显示了参数xgb_model的用法。

# for tuning incrementally in chunks
def objective_chunk(self, trial, n_chunksize):
nn_estimators = 500
nn_early_stopping_rounds = nn_estimators * 0.1
param = {
# tree_method would ideally be gpu_hist for faster speed
'tree_method':trial.suggest_categorical('tree_method', [tree_method]), 
# L2 regularization weight, Increasing this value will make model more conservative
'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
# L1 regularization weight, Increasing this value will make model more conservative
'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
# Min loss reduction for further partition on a leaf node. larger,the more conservative
'gamma':trial.suggest_categorical('gamma', [0,3,6]),
# sampling according to each tree
'colsample_bytree': trial.suggest_categorical('colsample_bytree',
[0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
# sampling ratio for training data
'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
'learning_rate': trial.suggest_categorical('learning_rate',
[0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02,0.05]),
'n_estimators': trial.suggest_categorical('n_estimators',[nn_estimators]),
# maximum depth of the tree, signifies complexity of the tree
'max_depth': trial.suggest_categorical('max_depth', [6,9,11,13]),
'random_state': trial.suggest_categorical('random_state', [48]),
# minimum child weight, larger the term more conservative the tree
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10)
}

model_xgbc = XGBClassifier(**param,use_label_encoder =False)  

# Fit Model
for i, X in enumerate(pd.read_csv(final_csv, chunksize =n_chunksize),start=1):
y = self.X.pop('target')
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size = 0.7, random_state=48)   
X, y = None, None
gc.collect()

if i == 1:            
print(f'Running Trial {trial.number} Chunk: {i}',end = ' | ')
model_xgbc.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
verbose=False, eval_metric = ['logloss'],
early_stopping_rounds = nn_early_stopping_rounds)
else:
print(f'{i}',end = ' | ')
model_xgbc = XGBClassifier(use_label_encoder =False)
model_xgbc.load_model(f'{savepath}model_xgbc.json')
model_xgbc.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
verbose=False, eval_metric = ['logloss'],
early_stopping_rounds = nn_early_stopping_rounds, 
xgb_model = f'{savepath}model_xgbc.json'
)
'''Auxiliary attributes of the Python Booster object (such as feature_names) will 
not be saved when using binary format. To save those attributes, use JSON instead.'''
model_xgbc.save_model(f'{savepath}model_xgbc.json')
preds = model_xgbc.predict(X_valid)

rmse = metrics.mean_squared_error(y_valid, preds,squared=False)
trial.report(rmse, i)

if trial.should_prune():
del param, model_xgbc, preds
X_train, y_train = None, None
X_valid, y_valid = None, None
gc.collect()
sleep(3)
raise optuna.TrialPruned()
else:
del model_xgbc
X_train, y_train = None, None
X_valid, y_valid = None, None
gc.collect()
sleep(3)
clear_gpu()

del param, preds
X_train, y_train = None, None
X_valid, y_valid = None, None
gc.collect()
sleep(3)

return rmse

这个目标由下面的代码调用,只是一个例子:

nn_trials = 20
nn_chunksize = 10000        # number of rows
study.optimize(lambda trial: otb.objective_chunk(trial, nn_chunksize), 
n_trials = nn_trials,
gc_after_trial = True)

相关内容

  • 没有找到相关文章

最新更新