管道中的自定义sklearn转换器为cross_validate抛出IndexError,但在使用GridSearchC



我使用sklearn的TransformerMixin和BaseEstimator类创建了一个自定义转换器(TopQuantile()(,如下所示,基本上只在颠簸或熊猫输入特征/列上分别运行np.percentile()pd.DataFrame.quantile(),以确定特征中的哪些值在用户指定的分位数内,哪些值不在,然后将每行的计数写入一个新的numpy/Pandas列。

这里的问题是,当我使用cross_validate运行Pipeline时,它会抛出IndexError: index 10 is out of bounds for axis 1 with size 10。我看了又看,这似乎没有任何意义,因为我在transformer的fit()方法中的所有计算都只假设提供了与输入X相同数量的特征/列,它不在乎有多少行(即使IndexError担心axis = 1(行(没有预期的计数。

现在来看最奇怪的部分:当我在GridSearchCV中运行Pipeline时,它运行得非常好,并为我提供了我所期望的输出!为什么cross_validate会抛出这样一个基本错误,表明我的转换器有一个固有的缺陷,而GridSearchCV却可以正常工作??请帮忙。下面包括我的transformer、我正在使用的PipelineGridSearchCV调用和cross_validate调用的副本(请注意,我正在使用Python 2.7,这是我进行此项目的课程所要求的(:

自定义变压器:

from sklearn.base import TransformerMixin, BaseEstimator
class TopQuantile(BaseEstimator, TransformerMixin):
'''
Engineer a new feature using the top quantile values of a given set of features. 
For every value in those features, check to see if the value is within the top q-quantile
of that feature. If so, increase the count for that sample by +1. New feature is an integer count
of how often each sample had a value in the top q-quantile of the specified features.
This class's fit(), transform(), and fit_transform() methods all assume a pandas DataFrame as input.
'''
import pandas as pd
def __init__(self, new_feature_name = 'top_finance', feature_list = None, q = 0.90):
'''
Constructor for TopQuantile objects. 
Parameters
----------
new_feature_name: str. Name of the feature that will be added as a pandas DataFrame column
upon transformation. Only used if X is a DataFrame.
feature_list: list of str or int.
If X is a Dataframe: Names of feature columns that should be included in 
the count of top quantile membership.
If X is a 2D numpy array: Integer positions for the columns to be used
q: float. Corresponds to the percentage quantile you want to be counting for. For example,
q = 0.90 looks at the 90% percentile (top decile).
'''
self.new_feature_name = new_feature_name
self.feature_list = feature_list
self.q = q
def fit(self, X, y = None):
'''
Calculates the q-quantile properly both for features that are largely positive
and ones that are largely negative (as DataFrame.quantile() does not do this correctly).
For example, if most of a feature's data points are between (-1E5,0), the "top decile"
should not be -100, it should be -1E4.
Parameters
----------
X: features DataFrame or numpy array, one feature per column
y: labels DataFrame/numpy array, ignored
'''

if isinstance(X, pd.DataFrame):
#Is self.feature_list something other than a list of strings?
if not isinstance(self.feature_list[0], str):
raise TypeError('feature_list is not a list of strings')
#Majority-negative features need to check df.quantile(1-q)
#in order to be using correct quantile value
pos = X.loc[:,self.feature_list].quantile(self.q)
neg = X.loc[:,self.feature_list].quantile(1.0-self.q)
#Replace negative quantile values of neg within pos to create 
#merged Series with proper quantile values for majority-positive
#and majority-negative features
pos.loc[neg < 0] = neg.loc[neg < 0]
self.quants = pos
#Are features a NumPy array?
elif isinstance(X, np.ndarray):
#Is self.feature_list something other than a list of int?
if not isinstance(self.feature_list[0], int):
raise TypeError('feature_list is not a list of integers')
#Majority-negative features need to check df.quantile(1-q)
#in order to be using correct quantile value
pos = np.percentile(X[:, self.feature_list], self.q * 100, axis = 0)
neg = np.percentile(X[:, self.feature_list], (1.0 - self.q) * 100, axis = 0)
#It's easier to work in a DataFrame, and now we don't need to know column names,
#so let's switch over to a DataFrame for a moment
#pos = pd.DataFrame(pos)
#neg = pd.DataFrame(neg)
#Replace negative quantile values of neg within pos to create 
#merged Series with proper quantile values for majority-positive
#and majority-negative features
pos[neg < 0] = neg[neg < 0]
self.quants = pos
else:
raise TypeError('Features need to be either pandas DataFrame or numpy array')


def transform(self, X):
'''
Using quantile information from fit(), adds a new feature to X that contains integer counts
of how many times a sample had a value that was in the top q-quantile of its feature, limited
to only features in self.feature_list
Parameters
----------
X: features DataFrame or numpy array, one feature per column
Returns
----------
If X is a DataFrame: Input DataFrame with additional column for new_feature, called self.new_feature_name
If X is a 2D numpy array: same as for the DataFrame case, except is a numpy array with no column names
'''
#Change all values in X to True or False if they are or are not within the
#top q-quantile
if isinstance(X, pd.DataFrame):
self.boolean = X.loc[:,self.feature_list].abs() >= self.quants.abs()
#Sum across each row to produce the counts
X[self.new_feature_name] = self.boolean.sum(axis = 1)

elif isinstance(X, np.ndarray):
self.boolean = np.absolute(X[:,self.feature_list]) >= np.absolute(self.quants)            
X = np.vstack((X.T, np.sum(self.boolean, axis = 1))).T
else:
raise TypeError('Features need to be either pandas DataFrame or numpy array')    
return X
def fit_transform(self, X, y = None):
'''
Provides the identical output to running fit() and then transform() in one nice little package.
Parameters
----------
X: features DataFrame or 2D numpy array, one feature per column
y: labels DataFrame, ignored
'''
self.fit(X, y)
return self.transform(X)

管道:

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Imputer, RobustScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
#Suppress the warnings coming from GridSearchCV to reduce output messages
import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore",category=sklearn.exceptions.UndefinedMetricWarning)
features = df.drop(columns = ['poi'])
labels = df['poi']
#--------------------------------- CROSS-VALIDATION -----------------------------------------
#Shuffled and stratified cross-validation binning for this tuning exercise
cv_10 = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state = 42)
#--------------------------------- IMPUTATION -----------------------------------------
#Imputation using the median of each feature
imp = Imputer(missing_values='NaN', strategy='median')
#--------------------------------- FEATURE ENGINEERING -----------------------------------------
#Feature Engineering with TopQuantile() to count the top quantile financial features
feats = ['salary', 'total_payments', 'bonus', 'total_stock_value', 'expenses', 
'exercised_stock_options', 'other', 'restricted_stock']
#Since numpy needs the columns as integer positions instead of names...
feats_loc_list = []
for e in feats:
feats_loc_list.append(features.columns.get_loc(e))
topQ = TopQuantile(feature_list = feats_loc_list)
#--------------------------------- FEATURE SCALING -----------------------------------------
#Feature Scaling via RobustScaler()
scaler = RobustScaler()
#--------------------------------- FEATURE SELECTION -----------------------------------------
#Feature Selection via SelectPercentile(f_classif, percentile = 75)
selector = SelectPercentile(score_func = f_classif, percentile = 75)
#--------------------------------- TUNING -----------------------------------------
#FeatureUnion to keep track of kNN and SVM model results
knn = KNeighborsClassifier()
knn_param_grid = {'kNN__n_neighbors': range(1,21,1), 'kNN__weights': ['uniform', 'distance'],
'kNN__p': [1,2]}
#Hyperparameter tuning
knn_pipe = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', knn)])

GridSearchCV调用:

knn_gs = GridSearchCV(knn_pipe, knn_param_grid, scoring = ['precision', 'recall', 'f1'], 
cv = cv_10, refit = 'f1', return_train_score = False)
knn_gs.fit(features, labels)

交叉验证调用:

knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', knn_gs.best_estimator_)])

cv_1000 = StratifiedShuffleSplit(n_splits=1000, test_size=0.2, random_state=42)
from sklearn.model_selection import cross_validate
knn_scores = cross_validate(knn_pipe_tuned, features, labels, groups=None, 
scoring=['precision', 'recall', 'f1'], cv=cv_1000)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-147-4f04d5e63a0b> in <module>()
12 from sklearn.model_selection import cross_validate
13 knn_scores = cross_validate(knn_pipe_tuned, features, labels, groups=None, 
---> 14                             scoring=['precision', 'recall', 'f1'], cv=cv_1000)
15 
16 knn_cv_results = pd.DataFrame(knn_scores)
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)
204             fit_params, return_train_score=return_train_score,
205             return_times=True)
--> 206         for train, test in cv.split(X, y, groups))
207 
208     if return_train_score:
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
777             # was dispatched. In particular this covers the edge
778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
780                 self._iterating = True
781             else:
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
623                 return False
624             else:
--> 625                 self._dispatch(tasks)
626                 return True
627 
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _dispatch(self, batch)
586         dispatch_timestamp = time.time()
587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
589         self._jobs.append(job)
590 
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in apply_async(self, func, callback)
109     def apply_async(self, func, callback=None):
110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
112         if callback:
113             callback(result)
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in __init__(self, batch)
330         # Don't delay the application, to avoid keeping the input
331         # arguments in memory
--> 332         self.results = batch()
333 
334     def get(self):
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self)
129 
130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
132 
133     def __len__(self):
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
456             estimator.fit(X_train, **fit_params)
457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
459 
460     except Exception as e:
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
248         Xt, fit_params = self._fit(X, y, **fit_params)
249         if self._final_estimator is not None:
--> 250             self._final_estimator.fit(Xt, y, **fit_params)
251         return self
252 
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
246             This estimator
247         """
--> 248         Xt, fit_params = self._fit(X, y, **fit_params)
249         if self._final_estimator is not None:
250             self._final_estimator.fit(Xt, y, **fit_params)
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit(self, X, y, **fit_params)
211                 Xt, fitted_transformer = fit_transform_one_cached(
212                     cloned_transformer, None, Xt, y,
--> 213                     **fit_params_steps[name])
214                 # Replace the transformer of the step with the fitted
215                 # transformer. This is necessary when loading the transformer
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
360 
361     def __call__(self, *args, **kwargs):
--> 362         return self.func(*args, **kwargs)
363 
364     def call_and_shelve(self, *args, **kwargs):
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit_transform_one(transformer, weight, X, y, **fit_params)
579                        **fit_params):
580     if hasattr(transformer, 'fit_transform'):
--> 581         res = transformer.fit_transform(X, y, **fit_params)
582     else:
583         res = transformer.fit(X, y, **fit_params).transform(X)
<ipython-input-108-dfcab4b62582> in fit_transform(self, X, y)
138         '''
139 
--> 140         self.fit(X, y)
141         return self.transform(X)
<ipython-input-108-dfcab4b62582> in fit(self, X, y)
73             #Majority-negative features need to check df.quantile(1-q)
74                 #in order to be using correct quantile value
---> 75             pos = np.percentile(X[:, self.feature_list], self.q * 100, axis = 0)
76             neg = np.percentile(X[:, self.feature_list], (1.0 - self.q) * 100, axis = 0)
77 
IndexError: index 10 is out of bounds for axis 1 with size 10

当您将管道发送到GridSearchCV时,best_estimator_还包含一个管道对象(无论您是只调优了该管道的单个部分还是所有部分(。

所以当你这样做的时候:

knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', knn_gs.best_estimator_)])

你基本上是在做这件事:

knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', Pipeline([('impute', imp),           
('engineer',topQ), 
('scale', scaler),
('select', selector), 
('kNN', knn)]))])

因此,这将再次imputeengineerscaleselect已经通过这一切的数据。我确信这不是你想要的。

在执行cross_validate时,您只需要执行以下操作:

knn_pipe_tuned = knn_gs.best_estimator_

最新更新