Sklearn转换管道和功能

尝试运行以下代码时会出现问题。这是住房价格的机器学习问题。

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
num_attributes=list(housing_num)
cat_attributes=['ocean_proximity']
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class DataFrameSelector(BaseEstimator,TransformerMixin):
    def __init__(self,attribute_names):
        self.attribute_names=attribute_names
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        return X[self.attribute_names].values
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room 
    def fit(self, X,y=None):
        return self # nothing else to do 
    def transform(self, X,y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix] 
        population_per_household = X[:, population_ix] / X[:, household_ix] 
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] 
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

num_pipeline=Pipeline([
    ('selector',DataFrameSelector(num_attributes)),
    ('imputer',Imputer(strategy="median")),
    ('attribs_adder',CombinedAttributesAdder()),
    ('std_scalar',StandardScaler()),
    ])
cat_pipeline=Pipeline([
    ('selector',DataFrameSelector(cat_attributes)),
    ('label_binarizer',LabelBinarizer()),
    ])
full_pipeline=FeatureUnion(transformer_list=[
    ("num_pipeline",num_pipeline),
    ("cat_pipeline",cat_pipeline),
    ])

当我尝试运行时出现错误：

housing_prepared = full_pipeline.fit_transform(housing)

，错误显示为：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-141-acd0fd68117b> in <module>()
----> 1 housing_prepared = full_pipeline.fit_transform(housing)
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
    744             delayed(_fit_transform_one)(trans, weight, X, y,
    745                                         **fit_params)
--> 746             for name, trans, weight in self._iter())
    747 
    748         if not result:
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
    780                 self._iterating = True
    781             else:
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
    626                 return True
    627 
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _dispatch(self, batch)
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
    589         self._jobs.append(job)
    590 
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in apply_async(self, func, callback)
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
    112         if callback:
    113             callback(result)
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in __init__(self, batch)
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
    333 
    334     def get(self):
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit_transform_one(transformer, weight, X, y, **fit_params)
    587                        **fit_params):
    588     if hasattr(transformer, 'fit_transform'):
--> 589         res = transformer.fit_transform(X, y, **fit_params)
    590     else:
    591         res = transformer.fit(X, y, **fit_params).transform(X)
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
    290         Xt, fit_params = self._fit(X, y, **fit_params)
    291         if hasattr(last_step, 'fit_transform'):
--> 292             return last_step.fit_transform(Xt, y, **fit_params)
    293         elif last_step is None:
    294             return Xt
TypeError: fit_transform() takes exactly 2 arguments (3 given)

so 我的第一个Questio n是原因是原因吗？

得到此错误后，我试图弄清楚为什么我会一个人运行上述变压器：

DFS=DataFrameSelector(num_attributes)
a1=DFS.fit_transform(housing)
imputer=Imputer(strategy='median')
a2=imputer.fit_transform(a1)
CAA=CombinedAttributesAdder()
a3=CAA.fit_transform(a2)
SS=StandardScaler()
a4=SS.fit_transform(a3)
DFS2=DataFrameSelector(cat_attributes)
b1=DFS2.fit_transform(housing)
LB=LabelBinarizer()
b2=LB.fit_transform(b1)
result=np.concatenate((a4,b2),axis=1)

这些可以正确执行，除了我得到的 result 是一个numpy.ndarray，大小（16512，16），而 housing_prepared = full_pipeline.fit_transform(housing)的预期结果应该是bumpy.ndarray。17）。所以这是我的第二个问题，为什么会导致差异？

外壳是一个大小（16512，9），只有1个分类功能和8个数值功能的数据框。

预先感谢您。

看起来Sklearn以另一种方式标识数据类型。确保将数字识别为INT。最简单的方法：使用"您的"已发布的编码作者提供的数据。Aurelien Geron动手进行机器学习

我在完成这本书时遇到了这个问题。在尝试了一堆解决方法（我觉得这是浪费时间）之后，我屈服并安装了Scikit-Learn V0.20 Dev。在此处下载轮子并使用PIP安装。这应该使您可以使用旨在解决这些问题的分类辅助编码类。

我遇到了同样的问题，这是由于缩进问题而引起的，该问题不会总是丢下错误（请参阅https://stackoverflow.com/a/14046894/3665886）。

如果您直接从书中复制代码，请确保正确缩进了代码。

typeError：fit_transform（）精确2个参数（3给定）

为什么这个错误？

答案：因为您使用的是labelbinarizer（），它非常适合响应变量。

该怎么办？：您有一些选择：

使用onehotencoder（）代替
编写labelbinarizer的自定义变压器
使用支持您的代码的旧版本的Sklean版本

housing_prepared

如果您使用的是这些数据，则拥有9个预测变量（8个数值＆amp; 1分类）。CombinedAttributesAdder（）添加了3列和LabelBinarizer（）增加了5个，因此它变为17列
请记住， sklearn.pipeline.featureunion串联多个变压器对象的结果

手动进行操作时，您不会添加原始的" Ocean_Proximity"变量。

让我们在行动中查看它：

print("housing_shape: ", housing.shape)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
DFS=DataFrameSelector(num_attribs)
a1=DFS.fit_transform(housing)
print('Numerical variables_shape: ', a1.shape)
imputer=SimpleImputer(strategy='median')
a2=imputer.fit_transform(a1)
a2.shape

与A1.Shape

相同

CAA=CombinedAttributesAdder()
a3=CAA.fit_transform(a2)
SS=StandardScaler()
a4=SS.fit_transform(a3) # added 3 variables
print('Numerical variable shape after CAA: ', a4.shape, 'n')
DFS2=DataFrameSelector(cat_attribs)
b1=DFS2.fit_transform(housing)
print("Categorical variables_shape: ", b1.shape)
LB=LabelBinarizer()
b2=LB.fit_transform(b1) # instead of one column now we have 5 columns
print('categorical variable shape after LabelBinarization: ', b2.shape)

4列增加

print(b2)
result=np.concatenate((a4,b2),axis=1)
print('final shape: ', result.shape, 'n') # Final shape

注意：转换的列（A4的结果）和二进制列（B2的结果）尚未添加到原始数据框架中。为此，您需要将numpy Array B2转换为数据框架

new_features = pd.DataFrame(a4)
new_features.shape
ocean_cat = ['<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND']
ocean_LabelBinarize = pd.DataFrame(b2, columns=[ocean_cat[i] for i in 
range(len(ocean_cat))])
ocean_LabelBinarize
housing_prepared_new = pd.concat([new_features, ocean_LabelBinarize], 
axis=1)
print('Shape of new data prepared by above steps', 
housing_prepared_new.shape)

当我们使用管道时，它也将原始（Ocean_Proximity）变量保留，并且新创建的二进制列也

相关内容

最新更新

热门标签：