sklearn GridSearchCV:ValueError:X每个样本有21个特征;预期19

我正试图在sklearn中运行GridSearchCV进行逻辑回归，但代码给了我以下错误：

ValueError: X has 21 features per sample; expecting 19

训练和测试数据的形状是

X_train.shape
(891L, 21L)  
X_test.shape
(418L, 21L)

我用来运行GridSearchCV的代码是

from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
logistic = LogisticRegression()
parameters = [{'C' : [1.0, 10.0, 100.0, 1000.0],
               'fit_intercept' : ['True', 'False'],
               'intercept_scaling' : [0, 1, 10, 100, 1000],
               'class_weight' : ['auto'],
               'random_state' : [26],
               'tol' : [0.001, 0.01, 0.1, 1, 10, 100]
               }]
logistic = GridSearchCV(LogisticRegression(),
                        parameters,
                        cv=3,
                        refit=True,
                        verbose=1)
logistic = logistic.fit(X_train, y_train)
logit_pred = logistic.predict(X_test)

我得到的回溯是：

ValueError                                Traceback (most recent call last)
C:Codekaggletitanictitanic.py in <module>()
    351 
    352 
--> 353     logistic = logistic.fit(X_train, y_train)
    354 
    355     logit_pred = logistic.predict(X_test)
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearngrid_search.pyc in fit(self, X, y)
    594 
    595         """
--> 596         return self._fit(X, y, ParameterGrid(self.param_grid))
    597 
    598 
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearngrid_search.pyc in _fit(self, X, y, parameter_iterable)
    376                                     train, test, self.verbose, parameters,
    377                                     self.fit_params, return_parameters=True)
--> 378             for parameters in parameter_iterable
    379             for train, test in cv)
    380 
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnexternalsjoblibparallel.pyc in __call__(self, iterable)
    651             self._iterating = True
    652             for function, args, kwargs in iterable:
--> 653                 self.dispatch(function, args, kwargs)
    654 
    655             if pre_dispatch == "all" or n_jobs == 1:
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnexternalsjoblibparallel.pyc in dispatch(self, func, args, kwargs)
    398         """
    399         if self._pool is None:
--> 400             job = ImmediateApply(func, args, kwargs)
    401             index = len(self._jobs)
    402             if not _verbosity_filter(index, self.verbose):
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnexternalsjoblibparallel.pyc in __init__(self, func, args, kwargs)
    136         # Don't delay the application, to avoid keeping the input
    137         # arguments in memory
--> 138         self.results = func(*args, **kwargs)
    139 
    140     def get(self):
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearncross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
   1238     else:
   1239         estimator.fit(X_train, y_train, **fit_params)
-> 1240     test_score = _score(estimator, X_test, y_test, scorer)
   1241     if return_train_score:
   1242         train_score = _score(estimator, X_train, y_train, scorer)
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearncross_validation.pyc in _score(estimator, X_test, y_test, scorer)
   1294         score = scorer(estimator, X_test)
   1295     else:
-> 1296         score = scorer(estimator, X_test, y_test)
   1297     if not isinstance(score, numbers.Number):
   1298         raise ValueError("scoring must return a number, got %s (%s) instead."
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnmetricsscorer.pyc in _passthrough_scorer(estimator, *args, **kwargs)
    174 def _passthrough_scorer(estimator, *args, **kwargs):
    175     """Function that wraps estimator.score"""
--> 176     return estimator.score(*args, **kwargs)
    177 
    178 
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnbase.pyc in score(self, X, y, sample_weight)
    289         """
    290         from .metrics import accuracy_score
--> 291         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    292 
    293 
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnlinear_modelbase.pyc in predict(self, X)
    213             Predicted class label per sample.
    214         """
--> 215         scores = self.decision_function(X)
    216         if len(scores.shape) == 1:
    217             indices = (scores > 0).astype(np.int)
C:UsersUserAppDataLocalEnthoughtCanopyUserlibsite-packagessklearnlinear_modelbase.pyc in decision_function(self, X)
    194         if X.shape[1] != n_features:
    195             raise ValueError("X has %d features per sample; expecting %d"
--> 196                              % (X.shape[1], n_features))
    197 
    198         scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 21 features per sample; expecting 19

为什么GridSearchCV期望的功能数量与数据集包含的功能数量不同？

更新：

谢谢你的回复，安迪。数据集的类型均为numpy.ndarray，数据类型为float64。

type(X_Train)    type(y_train)    type(X_test)
numpy.ndarray    numpy.ndarray    numpy.ndarray

在我将它们引入sklearn:之前的步骤

train_data = traindf.values
test_data = testdf.values
X_train = train_data[0::, 1::]  # training features
y_train = train_data[0::, 0]    # training targets
X_test = test_data[0::, 0::]    # test features

下一步是我在上面键入的GridSearchCV代码。。。

更新2：数据链接

以下是数据集的链接

错误是由intercept_scaling=0引起的。看起来像是scikit学习中的一个bug。

相关内容

最新更新

热门标签：