我试图让TfidfVectorizer在管道中工作,但管道生成了一个连接轴不匹配的错误。当在管道外调用TfidfVectorizer时,它似乎可以正常工作,这在目前非常简单。以下是将生成错误的代码。
text_features = data.select_dtypes(include=['object']).columns
numeric_features = data.drop(['target'],axis=1).select_dtypes(include=['int64','int32']).columns
numeric_transformer = StandardScaler()
text_transformer = TfidfVectorizer(max_df = 5)
preprocessor = ColumnTransformer(
transformers = [
('text', text_transformer, text_features),
('num', numeric_transformer, numeric_features) #errors are the same even if I comment this out.
])
X_train, X_test, y_train, y_test = train_test_split(data.drop(['target'], axis=1),
data['target'],
random_state=0)
pipe = Pipeline(steps = [
('preprocessor',preprocessor),
('SVC', SVC(C = 10000))])
X_train, X_test, y_train, y_test = train_test_split(data.drop(['target'], axis=1),
data['target'],
random_state=0)
text_transformer.fit(X_train[text_features]) # does not produce error
preprocessor.fit(X_train) #produces error (see below.)
这是错误消息,一个ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-248-bd1982955a12> in <module>
----> 1 func()
<ipython-input-247-d2024ad1ac0c> in func()
24 random_state=0)
25 text_transformer.fit(data['text'])
---> 26 preprocessor.fit(X_train)
27
28 print(X_train.shape, y_train.shape)
C:ProgramDataAnaconda3libsite-packagessklearncompose_column_transformer.py in fit(self, X, y)
492 # we use fit_transform to make sure to set sparse_output_ (for which we
493 # need the transformed data) to have consistent output type in predict
--> 494 self.fit_transform(X, y=y)
495 return self
496
C:ProgramDataAnaconda3libsite-packagessklearncompose_column_transformer.py in fit_transform(self, X, y)
551 self._validate_output(Xs)
552
--> 553 return self._hstack(list(Xs))
554
555 def transform(self, X):
C:ProgramDataAnaconda3libsite-packagessklearncompose_column_transformer.py in _hstack(self, Xs)
637 else:
638 Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 639 return np.hstack(Xs)
640
641 def _sk_visual_block_(self):
<__array_function__ internals> in hstack(*args, **kwargs)
C:ProgramDataAnaconda3libsite-packagesnumpycoreshape_base.py in hstack(tup)
343 return _nx.concatenate(arrs, 0)
344 else:
--> 345 return _nx.concatenate(arrs, 1)
346
347
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 4179
难度由第一行创建
text_features = data.select_dtypes(include=['object']).columns
返回pandas索引对象。
检查ColumnTransformer的文档,我们发现对于columns参数;当transformer期望X是类似于(向量(的1d数组时,应使用标量字符串或int,否则将向transformer传递2d数组">
为了按名称或数据类型选择多个列,我们需要使用make_column_selector。很简单:只需将index对象传递给make_column_selector,一切都会好起来:
from sklearn.compose import make_column_selector
...
('text', text_transformer, make_column_selector(text_features))
或者,如果列有稳定的名称,则可以在管道中分多个步骤传递列的名称。灵活性较低,但可能更可读。