使用适合sklearn gridsearchcv

我是Sklearn和python的新手；我有一个项目的代码片段，我正试图破译。我希望你们能帮我。

from repository import Repository
from configuration import config
repository = Repository(config)
dataset, labels = repository.get_dataset_and_labels()
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV  
# Ensure that there are no NaNs
dataset = dataset.fillna(-85)
# Split the dataset into training (90 %) and testing (10 %)
X_train, X_test, y_train, y_test = train_test_split(dataset, labels,      test_size = 0.1 )
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)
# Define the classifier to use
estimator = SVC(kernel='linear')
# Define parameter space
gammas = np.logspace(-6, -1, 10)
# Use Test dataset and use cross validation to find bet hyper-p  rameters.
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=dict(gamma=gammas))
classifier.fit(X_train, [repository.locations.keys().index(tuple(l))  for l in y_train])

我无法理解的是分类器的拟合方法的使用。在我在网上找到的所有例子中，"fit"都会收到训练数据和相应的标签。在上面的示例中，"fit"接收训练数据和标签（而不是标签）的索引。分类器是如何获取索引而不是标签并仍然工作的

标签只是一个抽象术语。它可以是任何东西，单词，数字，索引，任何东西。在您的情况下（无论repository.locations.keys().index(...)是什么，让我们假设它是一个确定性函数，为了简单起见，我们将其称为f），您可以创建一个列表

 [f(tuple(l)) for l in y_train]

y_train本身就是一个列表（或者更通用的可迭代列表）。因此，由于其他原因（可能在这种特殊情况下，用户需要与原始数据集中不同的标签集？），上面也是一个通过f简单转换的标签列表。无论哪种方式，您仍然将标签传递给fit方法，它们只是被转换。

例如，考虑一组标签['cat', 'dog']，我是在[x1, x2, x3]、['cat', 'cat', 'dog']上还是在[x2,x3,x3]、[0, 0, 1]（标签索引）上训练模型并不重要。

显然，您的标签编码在这里：

[repository.locations.keys().index(tuple(l))  for l in y_train]

除此之外，我认为值得一看SearchGridCV文档。

相关内容

最新更新

热门标签：