我正在执行一项分类任务,该任务本质上是进行算法配置,即尝试选择一种可能使解决问题的算法在最快时间内完成的配置(或"模式"(。
我正在学习将";最好的";基于问题实例特征的配置。我看到scikit learn使您能够创建自己的评分函数,用于调整模型。然而,CCD_ 1仅将真实标签和预测标签作为输入。
是否有可能确定预测来自数据集中的哪一行(传递给该自定义记分员时(?这样,我就可以计算出预测("错误"(配置的性能命中率,并相应地为模型打分。基本上有时是";错误的";选择仍然可以非常好并且接近最佳,但是当分类标签纯粹基于最佳配置时,天真的分类无法知道这一点。
这里有一个人为的例子来说明我试图做什么
import random as rnd
import pandas as pd
rnd.seed('hello')
probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)
bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].
rename(columns={'config':'best_config'})
feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
print(dataset)
df_alltimes:
problem config time
0 instance_0 analytic 15.307044
1 instance_0 bruteforce 36.742846
2 instance_0 hybrid 35.053416
3 instance_1 analytic 57.781358
4 instance_1 bruteforce 31.723275
5 instance_1 hybrid 8.080238
6 instance_2 analytic 4.211297
7 instance_2 bruteforce 24.034830
8 instance_2 hybrid 39.073023
9 instance_3 analytic 36.325485
10 instance_3 bruteforce 14.717841
11 instance_3 hybrid 57.103908
12 instance_4 analytic 7.358539
13 instance_4 bruteforce 10.805536
14 instance_4 hybrid 2.605044
15 instance_5 analytic 0.489870
16 instance_5 bruteforce 42.888858
17 instance_5 hybrid 58.634073
dataset:
best_config feature_0 feature_1 feature_2 feature_3 feature_4
0 analytic 0.645388 0.641626 0.975619 0.680713 0.209235
5 hybrid 0.993443 0.221038 0.893763 0.408532 0.254791
6 analytic 0.263872 0.142887 0.264538 0.166985 0.800054
10 bruteforce 0.155023 0.601300 0.258767 0.614732 0.850529
14 hybrid 0.766183 0.993692 0.597047 0.401482 0.275133
15 analytic 0.386327 0.065699 0.349115 0.370136 0.357329
我将sklearn与dataset
一起使用,其中X
将是特征列,y
将是best_config
列。在这个例子中;坏的";instance_0
的选择几乎都一样糟糕,但对于instance_1
,这两个错误的选择并不是同样糟糕。所以我希望我的自定义得分手能够以某种方式反映这一点。这可能吗?
最后,我确实找到了一种方法来获得我在原始问题中想要的信息。如果传递pandas.Series
作为目标标签,则index
属性是可用的,因此您可以在整个数据集中查找所需的内容。
在下面的解决方案中,第一部分与最初的最小工作示例基本相同,即生成一个伪数据集。
在第二部分中,定义了一个自定义记分器函数,然后将其传递给交叉验证超参数调谐器score_func
0。请记住,数据是垃圾,所以;结果";毫无意义;这只是如何参考更完整的结果集的演示,以便您可以基于更专业的信息而不仅仅是";匹配/失败";进行分类时。
import numpy as np
import pandas as pd
import random as rnd
INSTANCES = 200
FEATURES = 5
HP_ITER = 10
SEED = 1984
# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))
# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]
.rename(columns={'config':'target'})
.reset_index(drop=True)
# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]
def _vb_loss(xvals, yvals, validation=False):
"""A custom scorer for cross-validation which uses distance to Virtual Best"""
# use the .index attribute to access the relevant rows in the
# timing data frame
source = df_tst if validation else df_trn
data = source.loc[xvals.index].reindex(columns=['problem','target'])
data['truevals'] = xvals
data['predvals'] = yvals
# what's the best time available for each problem?
data = data.merge(
df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
).rename(columns={'time' : 'best_time'}).drop(columns=['config'])
# what's the time for our predicted choices?
data = data.merge(
df_times, left_on=['problem','predvals'], right_on=['problem','config']
).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])
# how far away were the predictions in total?
residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
return residual_seconds
def fitAndPredict(use_custom_scorer=False):
"""Fit a model and make some predictions """
our_scorer = make_scorer(_vb_loss, greater_is_better=False)
hyperparameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : list(range(50,250)),
'max_depth' : list(range(2,32))
}
model = RandomizedSearchCV(
RandomForestClassifier(random_state=SEED),
hyperparameters,
n_iter = HP_ITER,
scoring = our_scorer if use_custom_scorer else None,
verbose = 1,
random_state = SEED,
)
model.fit(
df_trn.drop(columns=['target','problem']),
df_trn['target']
)
preds = model.predict(df_tst.drop(columns=['target','problem']))
return _vb_loss(df_tst['target'], preds, validation=True)
print("Timings for all configs:", df_times, "", sep="n")
print("Labelled dataset:", dataset, "", sep="n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))
这是输出:
** Timings for all configs **
problem config time
0 p_000 analytic 21.811701
1 p_000 bruteforce 29.652341
2 p_000 hybrid 20.376605
3 p_001 analytic 12.989269
4 p_001 bruteforce 51.759137
.. ... ... ...
595 p_198 bruteforce 10.874092
596 p_198 hybrid 14.723661
597 p_199 analytic 24.984775
598 p_199 bruteforce 4.899111
599 p_199 hybrid 36.188729
[600 rows x 3 columns]
** Labelled dataset **
target problem feature_0 feature_1 feature_2 feature_3 feature_4
0 hybrid p_000 0.864952 0.487293 0.946654 0.863503 0.310866
1 analytic p_001 0.514093 0.007643 0.948784 0.582419 0.258159
2 bruteforce p_002 0.319059 0.872320 0.321495 0.807644 0.158471
3 analytic p_003 0.421063 0.955742 0.114808 0.980013 0.900057
4 hybrid p_004 0.325935 0.125824 0.697967 0.037196 0.923626
.. ... ... ... ... ... ... ...
195 hybrid p_195 0.179126 0.578338 0.391535 0.632501 0.442677
196 bruteforce p_196 0.827637 0.641567 0.710201 0.833341 0.215357
197 hybrid p_197 0.116661 0.480170 0.253893 0.623913 0.465419
198 bruteforce p_198 0.670555 0.037084 0.954332 0.408546 0.935973
199 bruteforce p_199 0.371541 0.463060 0.549176 0.581093 0.391114
[200 rows x 7 columns]
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 9.1s finished
Test loss with custom CV scorer : 522.3236277796698