关于 Sklearn Pipeline 和 Feature Union 的困惑

我试图制作一个玩具示例来说明我的困惑。我意识到这是一个关于iris数据集的愚蠢例子。

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key.
The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples.  i.e.
>> len(data[key]) == n_samples
Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample).
ItemSelector only requires that the collection implement getitem
(data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc.
>> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data)
ItemSelector is not designed to handle data grouped by sample.  (e.g. a
list of dicts).  If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.
Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
# 150 examples, 4 features, labels in {0, 1, 2}
iris = load_iris()
y = iris.target
dfX = pd.DataFrame(iris.data, columns=iris.feature_names)
# feature union transformer list 
transformer_list = [
('sepal length (cm)', Pipeline([
('selector', ItemSelector(key='sepal length (cm)')),
])),
('sepal width (cm)', Pipeline([
('selector', ItemSelector(key='sepal width (cm)')),
])),
('petal length (cm)', Pipeline([
('selector', ItemSelector(key='petal length (cm)')),
])),
('petal width (cm)', Pipeline([
('selector', ItemSelector(key='petal width (cm)')),
])),
]
# create pipeline
pipeline = Pipeline([
("union", FeatureUnion(transformer_list=transformer_list)),
("svm", SVC(kernel="linear")),
])
# train model
param_grid = dict({})
search = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=1)
search.fit(dfX, y)
print(search.best_estimator_)

它在上出错

/Users/me/.virtualenvs/myenv/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_consistent_length(*arrays)
179     if len(uniques) > 1:
180         raise ValueError("Found input variables with inconsistent numbers of"
--> 181                          " samples: %r" % [int(l) for l in lengths])
182
183
ValueError: Found input variables with inconsistent numbers of samples: [1, 99]

我的想法是FeatureUnions是并行的，Pipelines是串行的。

我怎么会认为这是错误的？将两者混合以获得特定特征类型的丰富流，并能够分段添加变换器，但仍然将所有变换器与FeatureUnion组合在一起作为最终预测因子，正确的方法是什么？

是的，你的想法是正确的：

我的想法是FeatureUnion是并行的，Pipelines是串行的。

但这里的问题是ItemSelector返回了一个形状为(150,)的numpy数组。在FeatureUnion的最后一步中，完成了来自不同转换器的特征的级联，使用了水平堆叠阵列的numpy.hstack()。请注意，输出中不存在第二个维度。因此，当对所有变换器进行并集时，会产生形状为(600,)的阵列。这就是为什么fit方法中出现错误的原因。

返回数组的正确形状应为(150,4)。必须向以正确形状返回数据的管道中再添加一个转换器(或使用numpy.asmatrix()和transpose()手动更改返回数据的形状)。

请查看FeatureUnion的文档示例。在本例中，ItemSelector的输出被传递到其他转换器，如TfidfVectorizer()或DictVectorizer()，后者以2d阵列返回数据，然后在FeatureUnion中正确合并。希望能有所帮助。

相关内容

最新更新

热门标签：