使用python中的面板数据调整随机森林超参数



问题:

  1. 如何在python中使用面板数据调整随机林的超参数
  2. 是否有已经实现的包和功能

我在以下位置寻找答案:

  1. https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
  2. https://stats.stackexchange.com/questions/326228/cross-validation-with-time-series
  3. https://stats.stackexchange.com/questions/369397/correct-cross-validation-procedure-for-single-model-applied-to-panel-data

所有这些都将我引向代码的当前状态。

问题:

我正试图预测一周内每种产品的销售量。我有大约5000种产品(按类别分组)和1.5年的历史。由于产品太多,为每种产品创建一个单独的模型似乎没有意义,因此一个大模型同时考虑一个产品类别。

我理解时间敏感交叉验证和嵌套交叉验证的概念,但缺乏有效实施它们的能力。

示例数据:

import pandas as pd
import numpy as np
from random import seed
from random import randint
seed(1)
Panel_data = pd.DataFrame({
'Product': ["A", "B"] * 10,
'Time': [ele for ele in range(1, 11) for i in range(2)],
'Z': [randint(0, 10) for ele in range(1, 21)],
'X': [randint(0, 10) for ele in range(1, 21)]})
Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]

我当前的嵌套滑动窗口CV方法(如链接中所述)

我创建了一个循环,它正确地允许模型学习数据,然后使用sklearn.model_selection包中的RandomizedSearchCV,我在给定的子集中找到最佳超参数。在迭代数据中的所有时间id之后,我选择中值超参数。

这种方法非常耗时!我想知道这是否是正确的方法,是否有更好的方法?

from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import statistics
from statistics import mode

rf_beseline = RandomForestRegressor(n_estimators = 2, random_state = 42)
OLS_baseline = linear_model.LinearRegression()
random_grid = {'n_estimators': [100, 500],
'max_features': ['auto', 'sqrt'],
'max_depth': [3, None],
'min_samples_split': [3, 10],
'min_samples_leaf': [3, 10]}
rf_MSE = list()
n_estimators = list()
min_samples_split = list()
min_samples_leaf = list()
max_features = list()
max_depth = list()
for i in range(1, 10):
print(i)
X_train = Panel_data.loc[Panel_data['Time'] == i, ['X', 'Z']]
Y_train = Panel_data.loc[Panel_data['Time'] == i, 'Y']
X_test = Panel_data.loc[Panel_data['Time'] == i + 1, ['X', 'Z']]
Y_test = Panel_data.loc[Panel_data['Time'] == i + 1, 'Y']

#random forest
rf_beseline.fit(X_train, Y_train)
y_pred = rf_beseline.predict(X_test) 
mse = mean_squared_error(Y_test, y_pred)
rf_MSE = rf_MSE + [mse]

# hiperparamiters
rf_rs = RandomizedSearchCV(estimator = rf_beseline, param_distributions = random_grid, n_iter = 5, cv = 2, verbose = 2, random_state = 42, n_jobs = -1)
rf_rs.fit(X_train, Y_train)
n_estimators = n_estimators + [rf_rs.best_params_.get('n_estimators')]
min_samples_split = min_samples_split + [rf_rs.best_params_.get('min_samples_split')]
min_samples_leaf = min_samples_leaf + [rf_rs.best_params_.get('min_samples_leaf')]
max_features = max_features + [rf_rs.best_params_.get('max_features')]
max_depth = max_depth + [rf_rs.best_params_.get('max_depth')]

np.mean(rf_MSE)

#selected hiperparamiters
sel_n_estimators = np.median(n_estimators)
sel_min_samples_split = np.median(min_samples_split)
sel_min_samples_leaf = np.median(min_samples_leaf)
sel_max_features = mode(max_features)
sel_max_depth = None if  np.median([100 if v is None else v for v in max_depth]) == 100 else  np.median([100 if v is None else v for v in max_depth])

rf_best = RandomForestRegressor(n_estimators = sel_n_estimators, 
random_state = 42,
min_samples_split = sel_min_samples_split, 
min_samples_leaf = sel_min_samples_leaf,
max_features = sel_max_features, 
max_depth = sel_max_depth,
bootstrap = True) 

我的期望和希望:

我希望已经实现了一个很容易使用的函数RandomizedSearchCV,它将比我实现的循环工作得更快

有一个名为optuna的奇妙软件包,它以智能的方式用于超参数调整。

简而言之;您为每个超参数指定一个范围,然后optuna根据前一组超参数的结果,即贝叶斯优化,选择下一对超参数进行测试。

这个视频提供了如何使用它的一个很好的概述(示例从大约5点开始)

1。使用for循环生成数据

循环确定如何生成列车/测试数据。这与RandomizedSearchCV无关。RandomizedSearchCV可能会给我们提供好(幸运)或坏的模型参数,这是正常的,因为这只是随机的。

下面是一个使用optuna优化参数的示例实现。数据仍然由循环生成。重要的是创建我们的目标函数并返回mse我们的目标值。

"""
Using optuna hyperparameter optimizer.
Ref: https://github.com/optuna/optuna
"""
import time
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import optuna

Panel_data = pd.DataFrame({
'Product': ["A", "B"] * 10,
'Time': [ele for ele in range(1, 11) for i in range(2)],
'Z': [randint(0, 10) for ele in range(1, 21)],
'X': [randint(0, 10) for ele in range(1, 21)]})
Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]

def objective(trial):
# Define model with init values from optuna.
rf_model = RandomForestRegressor(
n_estimators = trial.suggest_int('n_estimators', 100, 500),
min_samples_split = trial.suggest_int('min_samples_split', 3, 10),
min_samples_leaf = trial.suggest_int('min_samples_leaf', 3, 10),
max_features = trial.suggest_categorical("max_features", ["auto", "sqrt"]),
max_depth = trial.suggest_int('max_depth', 3, 10),
bootstrap = True,
random_state = 42
)

allmse = []

# Create datasets in CV scheme considering timeseries data.
for i in range(1, 10):
X_train = Panel_data.loc[Panel_data['Time'] == i, ['Z', 'X']]
Y_train = Panel_data.loc[Panel_data['Time'] == i, 'Y']
X_test = Panel_data.loc[Panel_data['Time'] == i + 1, ['Z', 'X']]
Y_test = Panel_data.loc[Panel_data['Time'] == i + 1, 'Y']

# Fit the train data.    
rf_model.fit(X_train, Y_train)

# Test the model with test data.        
y_pred = rf_model.predict(X_test)

# Save the mse.
mse = mean_squared_error(Y_test, y_pred)
allmse.append(mse)

return np.mean(allmse)  # Send mse as feedback to optuna sampler

def optuna_tune():
t0 = time.perf_counter()

num_trials = 30  # more is better especially if num param is high and param range is also high.
sampler = optuna.samplers.TPESampler(seed=1)  # TPE is optuna default sampler, others cmaes, skopt, etc

study = optuna.create_study(sampler=sampler, direction='minimize')
study.optimize(objective, n_trials=num_trials)

# Show the best params and mse value
best_params = study.best_params
print(f'best params: {study.best_params}')
print(f'best mean value: {study.best_value}')  

print(f'elapse: {time.perf_counter() - t0:0.1f}s')

# Start
optuna_tune()

输出:

...
[I 2021-12-08 13:29:24,229] Trial 29 finished with value: 44.890960282703766 and parameters: {'n_estimators': 272, 'min_samples_split': 3, 'min_samples_leaf': 9, 'max_features': 'auto', 'max_depth': 9}. Best is trial 3 with value: 44.515468624442995.
best params: {'n_estimators': 156, 'min_samples_split': 4, 'min_samples_leaf': 9, 'max_features': 'auto', 'max_depth': 8}
best mean value: 44.515468624442995
elapse: 74.8s

2.时间序列拆分

另一种准备时间序列数据的方法是通过sklearn中的TimeSeriesSplit()。它至少在默认情况下以不同的方式生成数据。它扩展列车数据,但保持序列,参见下面的示例。

Product  Time  Z   X   Y
0        A     1  2   4   9
1        B     1  2   9  12
2        A     2  5   3   3
3        B     2  2   4   5
4        A     3  2  10  11
5        B     3  9   2   3
6        A     4  8  10  18
7        B     4  4   0   6
8        A     5  3   5   7
9        B     5  8   1   8
10       A     6  6   6  12
11       B     6  7   6   8
12       A     7  7  10  17
13       B     7  8   8  15
14       A     8  8   2  10
15       B     8  4   9  18
16       A     9  2   7   7
17       B     9  8   7  16
18       A    10  9   8  11
19       B    10  8   7  16
X_train: [[2 4]]
Y_train: [9]
X_test: [[2 9]]
Y_test: [12]

然后在下一次折叠中,它在火车上取2,依此类推。它扩展了火车窗口。

X_train: [[2 4]
[2 9]]
Y_train: [ 9 12]
X_test: [[5 3]]
Y_test: [3]
...

我在下面的optuna优化器代码中使用了这个方案。

"""
Using optuna hyperparameter optimizer and sklearn TimeSeriesSplit
Ref: 
https://github.com/optuna/optuna
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
"""
import time
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
import optuna

Panel_data = pd.DataFrame({
'Product': ["A", "B"] * 10,
'Time': [ele for ele in range(1, 11) for i in range(2)],
'Z': [randint(0, 10) for ele in range(1, 21)],
'X': [randint(0, 10) for ele in range(1, 21)]})
Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]
print(Panel_data.to_string())

def objective(trial):
# Define model with init values from optuna.
rf_model = RandomForestRegressor(
n_estimators = trial.suggest_int('n_estimators', 100, 500),
min_samples_split = trial.suggest_int('min_samples_split', 3, 10),
min_samples_leaf = trial.suggest_int('min_samples_leaf', 3, 10),
max_features = trial.suggest_categorical("max_features", ["auto", "sqrt"]),
max_depth = trial.suggest_int('max_depth', 3, 10),
bootstrap = True,
random_state = 42
)

allmse = []

tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=19, test_size=None)
X = np.array(Panel_data[['Z', 'X']])
y = np.array(Panel_data[['Y']])

# Create datasets in CV scheme considering timeseries data.
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = y[train_index], y[test_index]

Y_train = Y_train.ravel()
Y_test = Y_test.ravel()

print(f'X_train: {X_train}')
print(f'Y_train: {Y_train}')
print(f'X_test: {X_test}')
print(f'Y_test: {Y_test}')

# Fit the train data.    
rf_model.fit(X_train, Y_train)

# Test the model with test data.        
y_pred = rf_model.predict(X_test)

# Save the mse.
mse = mean_squared_error(Y_test, y_pred)
allmse.append(mse)

return np.mean(allmse)  # Send mse as feedback to optuna sampler

def optuna_tune():
t0 = time.perf_counter()

num_trials = 20  # more is better especially if num param is high and param range is also high.
sampler = optuna.samplers.TPESampler(seed=1)  # TPE is optuna default sampler, others cmaes, skopt, etc

study = optuna.create_study(sampler=sampler, direction='minimize')
study.optimize(objective, n_trials=num_trials)

# Show the best params and mse value
best_params = study.best_params
print(f'best params: {study.best_params}')
print(f'best mean value: {study.best_value}')  

print(f'elapse: {time.perf_counter() - t0:0.1f}s')

# Start
optuna_tune()

输出:

[I 2021-12-08 15:06:44,324] A new study created in memory with name: no-name-20410ee9-790a-4ad1-8930-50baee3faefc
Product  Time  Z   X   Y
0        A     1  2   4   9
1        B     1  2   9  12
2        A     2  5   3   3
3        B     2  2   4   5
4        A     3  2  10  11
5        B     3  9   2   3
6        A     4  8  10  18
7        B     4  4   0   6
8        A     5  3   5   7
9        B     5  8   1   8
10       A     6  6   6  12
11       B     6  7   6   8
12       A     7  7  10  17
13       B     7  8   8  15
14       A     8  8   2  10
15       B     8  4   9  18
16       A     9  2   7   7
17       B     9  8   7  16
18       A    10  9   8  11
19       B    10  8   7  16
X_train: [[2 4]]
Y_train: [9]
X_test: [[2 9]]
Y_test: [12]
X_train: [[2 4]
[2 9]]
Y_train: [ 9 12]
X_test: [[5 3]]
Y_test: [3]
X_train: [[2 4]
[2 9]
[5 3]]
Y_train: [ 9 12  3]
X_test: [[2 4]]
Y_test: [5]
X_train: [[2 4]
[2 9]
[5 3]
[2 4]]
Y_train: [ 9 12  3  5]
X_test: [[ 2 10]]
Y_test: [11]
...
[I 2021-12-08 15:08:39,685] Trial 19 finished with value: 25.44589482974616 and parameters: {'n_estimators': 452, 'min_samples_split': 6, 'min_samples_leaf': 6, 'max_features': 'auto', 'max_depth': 10}. Best is trial 8 with value: 21.126358478438924.
best params: {'n_estimators': 215, 'min_samples_split': 4, 'min_samples_leaf': 3, 'max_features': 'auto', 'max_depth': 5}
best mean value: 21.126358478438924
elapse: 115.4s

由于面板数据是随机生成的,您的结果可能会有所不同。

相关内容

  • 没有找到相关文章