我想构建一个sklearn管道(一个更大的管道的一部分),它:
- 编码分类列 (OneHotEncoder)
- 减小尺寸 (SVD)
- 添加数字列(不进行转换)
- 聚合行(熊猫分组)
我使用了这个管道示例:
以及这个自定义 TranformerMixin 的例子:
我在步骤 4中收到错误(如果我评论步骤 4 则没有错误):
属性错误回溯(最近一次调用) 最后) 在 () 中 ----> 1 X_train_transformed = pipe.fit_transform(X_train) ....属性错误:"numpy.ndarray"对象没有属性"fit"
我的代码 :
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import TruncatedSVD
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
# does nothing, but is here to collect numerical columns
class nothing(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
class Aggregator(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X = pd.DataFrame(X)
X = X.rename(columns = {0 :'InvoiceNo', 1 : 'amount', 2:'Quantity',
3:'UnitPrice',4:'CustomerID' })
X['InvoiceNo'] = X['InvoiceNo'].astype('int')
X['Quantity'] = X['Quantity'].astype('float64')
X['UnitPrice'] = X['UnitPrice'].astype('float64')
aggregations = dict()
for col in range(5, X.shape[1]-1) :
aggregations[col] = 'max'
aggregations.update({ 'CustomerID' : 'first',
'amount' : "sum",'Quantity' : 'mean', 'UnitPrice' : 'mean'})
# aggregating all basket lines
result = X.groupby('InvoiceNo').agg(aggregations)
# add number of lines in the basket
result['lines_nb'] = X.groupby('InvoiceNo').size()
return result
numeric_features = ['InvoiceNo','amount', 'Quantity', 'UnitPrice',
'CustomerID']
numeric_transformer = Pipeline(steps=[('nothing', nothing())])
categorical_features = ['StockCode', 'Country']
preprocessor = ColumnTransformer(
[
# 'num' transformer does nothing, but is here to
# collect numerical columns
('num', numeric_transformer ,numeric_features ),
('cat', Pipeline([
('onehot', OneHotEncoder(handle_unknown='ignore')),
('best', TruncatedSVD(n_components=100)),
]), categorical_features)
]
)
# edit with Artem solution
# aggregator = ('agg', Aggregator())
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
# edit with Artem solution
# ('aggregator', aggregator),
('aggregator', Aggregator())
])
X_train_transformed = pipe.fit_transform(X_train)
管道步骤来自('name',Class),但原始任务基本上具有:
aggregator = ('agg', Aggregator())`
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('aggregator', aggregator),
])
这使它('aggregator', ('agg', Aggregator()))