我可以为sklearn中的自定义得分手函数获取额外信息吗



我正在执行一项分类任务,该任务本质上是进行算法配置,即尝试选择一种可能使解决问题的算法在最快时间内完成的配置(或"模式"(。

我正在学习将";最好的";基于问题实例特征的配置。我看到scikit learn使您能够创建自己的评分函数,用于调整模型。然而,CCD_ 1仅将真实标签和预测标签作为输入。

是否有可能确定预测来自数据集中的哪一行(传递给该自定义记分员时(?这样,我就可以计算出预测("错误"(配置的性能命中率,并相应地为模型打分。基本上有时是";错误的";选择仍然可以非常好并且接近最佳,但是当分类标签纯粹基于最佳配置时,天真的分类无法知道这一点。

这里有一个人为的例子来说明我试图做什么

import random as rnd
import pandas as pd
rnd.seed('hello')
probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)
bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].
rename(columns={'config':'best_config'}) 
feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
print(dataset)
df_alltimes:
problem      config       time
0   instance_0    analytic  15.307044
1   instance_0  bruteforce  36.742846
2   instance_0      hybrid  35.053416
3   instance_1    analytic  57.781358
4   instance_1  bruteforce  31.723275
5   instance_1      hybrid   8.080238
6   instance_2    analytic   4.211297
7   instance_2  bruteforce  24.034830
8   instance_2      hybrid  39.073023
9   instance_3    analytic  36.325485
10  instance_3  bruteforce  14.717841
11  instance_3      hybrid  57.103908
12  instance_4    analytic   7.358539
13  instance_4  bruteforce  10.805536
14  instance_4      hybrid   2.605044
15  instance_5    analytic   0.489870
16  instance_5  bruteforce  42.888858
17  instance_5      hybrid  58.634073
dataset:
best_config  feature_0  feature_1  feature_2  feature_3  feature_4
0     analytic   0.645388   0.641626   0.975619   0.680713   0.209235
5       hybrid   0.993443   0.221038   0.893763   0.408532   0.254791
6     analytic   0.263872   0.142887   0.264538   0.166985   0.800054
10  bruteforce   0.155023   0.601300   0.258767   0.614732   0.850529
14      hybrid   0.766183   0.993692   0.597047   0.401482   0.275133
15    analytic   0.386327   0.065699   0.349115   0.370136   0.357329

我将sklearn与dataset一起使用,其中X将是特征列,y将是best_config列。在这个例子中;坏的";instance_0的选择几乎都一样糟糕,但对于instance_1,这两个错误的选择并不是同样糟糕。所以我希望我的自定义得分手能够以某种方式反映这一点。这可能吗?

最后,我确实找到了一种方法来获得我在原始问题中想要的信息。如果传递pandas.Series作为目标标签,则index属性是可用的,因此您可以在整个数据集中查找所需的内容。

在下面的解决方案中,第一部分与最初的最小工作示例基本相同,即生成一个伪数据集。

在第二部分中,定义了一个自定义记分器函数,然后将其传递给交叉验证超参数调谐器score_func0。请记住,数据是垃圾,所以;结果";毫无意义;这只是如何参考更完整的结果集的演示,以便您可以基于更专业的信息而不仅仅是";匹配/失败";进行分类时。

import numpy as np
import pandas as pd
import random as rnd
INSTANCES = 200
FEATURES  = 5
HP_ITER   = 10
SEED      = 1984
# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))
# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]
.rename(columns={'config':'target'})
.reset_index(drop=True)
# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]
def _vb_loss(xvals, yvals, validation=False):
"""A custom scorer for cross-validation which uses distance to Virtual Best"""
# use the .index attribute to access the relevant rows in the
# timing data frame
source = df_tst if validation else df_trn

data = source.loc[xvals.index].reindex(columns=['problem','target'])
data['truevals'] = xvals
data['predvals'] = yvals
# what's the best time available for each problem?
data = data.merge(
df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
).rename(columns={'time' : 'best_time'}).drop(columns=['config'])
# what's the time for our predicted choices?
data = data.merge(
df_times, left_on=['problem','predvals'], right_on=['problem','config']
).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])
# how far away were the predictions in total?
residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
return residual_seconds

def fitAndPredict(use_custom_scorer=False):
"""Fit a model and make some predictions """
our_scorer = make_scorer(_vb_loss, greater_is_better=False)
hyperparameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : list(range(50,250)),
'max_depth' : list(range(2,32))
}
model = RandomizedSearchCV(
RandomForestClassifier(random_state=SEED),
hyperparameters,
n_iter = HP_ITER,
scoring = our_scorer if use_custom_scorer else None,
verbose = 1,
random_state = SEED,
)
model.fit(
df_trn.drop(columns=['target','problem']),
df_trn['target']
)
preds = model.predict(df_tst.drop(columns=['target','problem']))
return _vb_loss(df_tst['target'], preds, validation=True)

print("Timings for all configs:", df_times, "", sep="n")
print("Labelled dataset:", dataset, "", sep="n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))

这是输出:

** Timings for all configs **
problem      config       time
0     p_000    analytic  21.811701
1     p_000  bruteforce  29.652341
2     p_000      hybrid  20.376605
3     p_001    analytic  12.989269
4     p_001  bruteforce  51.759137
..      ...         ...        ...
595   p_198  bruteforce  10.874092
596   p_198      hybrid  14.723661
597   p_199    analytic  24.984775
598   p_199  bruteforce   4.899111
599   p_199      hybrid  36.188729
[600 rows x 3 columns]
** Labelled dataset **
target problem  feature_0  feature_1  feature_2  feature_3  feature_4
0        hybrid   p_000   0.864952   0.487293   0.946654   0.863503   0.310866
1      analytic   p_001   0.514093   0.007643   0.948784   0.582419   0.258159
2    bruteforce   p_002   0.319059   0.872320   0.321495   0.807644   0.158471
3      analytic   p_003   0.421063   0.955742   0.114808   0.980013   0.900057
4        hybrid   p_004   0.325935   0.125824   0.697967   0.037196   0.923626
..          ...     ...        ...        ...        ...        ...        ...
195      hybrid   p_195   0.179126   0.578338   0.391535   0.632501   0.442677
196  bruteforce   p_196   0.827637   0.641567   0.710201   0.833341   0.215357
197      hybrid   p_197   0.116661   0.480170   0.253893   0.623913   0.465419
198  bruteforce   p_198   0.670555   0.037084   0.954332   0.408546   0.935973
199  bruteforce   p_199   0.371541   0.463060   0.549176   0.581093   0.391114
[200 rows x 7 columns]
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done  50 out of  50 | elapsed:    8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done  50 out of  50 | elapsed:    9.1s finished
Test loss with custom CV scorer : 522.3236277796698

最新更新