scikit学习-加速预测

使用5个特征和3000个样本创建的SVM模型，使用默认参数进行预测，需要花费5个特征、100000个样本的更长时间（超过一小时）。有没有加快预测的方法？

这里需要考虑的几个问题：

你的输入矩阵X标准化了吗？SVM不是尺度不变的，因此如果算法在没有适当缩放的情况下获取大量原始输入，则很难进行分类。
参数C的选择：较高的C允许更复杂的非光滑决策边界，并且在这种复杂性下需要更多的时间来拟合。因此，将值C从默认值1降低到更低的值可以加速该过程。
还建议选择适当的gamma值。这可以通过网格搜索交叉验证来完成。

以下是进行网格搜索交叉验证的代码。为了简单起见，我忽略了这里的测试集。

import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
# generate some artificial data
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9])
# make a pipeline for convenience
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
# set up parameter space, we want to tune SVC params C and gamma
# the range below is 10^(-5) to 1 for C and 0.01 to 100 for gamma
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
# choose your customized scoring function, popular choices are f1_score, accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
# construct grid search
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
# what's the best estimator
gscv.best_params_
Out[20]: {'svc__C': 1.0, 'svc__gamma': 0.21544346900318834}
# what's the best score, in our case, roc_auc_score
gscv.best_score_
Out[22]: 0.86819366014152421

注意：SVC的运行速度仍然不是很快。计算50种可能的参数组合需要超过40秒的时间。

%time gscv.fit(X, y)
CPU times: user 42.6 s, sys: 959 ms, total: 43.6 s
Wall time: 43.6 s

因为特性的数量相对较少，所以我会从减少惩罚参数开始。它控制了对列车数据中错误标记样本的惩罚，由于您的数据包含5个特征，我想它不是完全线性可分离的。

通常，这个参数（C）允许分类器由于更高的精度而具有更大的裕度（有关更多信息，请参阅此）

默认情况下，C=1.0。从svm = SVC(C=0.1)开始，看看进展如何。

一个原因可能是参数gamma不相同。

默认情况下，sklearn.svm.SVC使用RBF内核，gamma为0.0，在这种情况下，将使用1/n_features。因此，CCD_ 11在给定不同数量的特征时是不同的。

就建议而言，我同意建勋的回答。

相关内容

最新更新

热门标签：