如何在管道中获取 TfidfVectorizer?串联轴不匹配



我试图让TfidfVectorizer在管道中工作,但管道生成了一个连接轴不匹配的错误。当在管道外调用TfidfVectorizer时,它似乎可以正常工作,这在目前非常简单。以下是将生成错误的代码。

text_features = data.select_dtypes(include=['object']).columns
numeric_features = data.drop(['target'],axis=1).select_dtypes(include=['int64','int32']).columns
numeric_transformer = StandardScaler()
text_transformer = TfidfVectorizer(max_df = 5)    
preprocessor = ColumnTransformer(
transformers = [
('text', text_transformer, text_features),
('num', numeric_transformer, numeric_features) #errors are the same even if I comment this out.

])
X_train, X_test, y_train, y_test = train_test_split(data.drop(['target'], axis=1), 
data['target'], 
random_state=0)
pipe = Pipeline(steps = [
('preprocessor',preprocessor),
('SVC', SVC(C = 10000))])
X_train, X_test, y_train, y_test = train_test_split(data.drop(['target'], axis=1), 
data['target'], 
random_state=0)
text_transformer.fit(X_train[text_features]) # does not produce error
preprocessor.fit(X_train) #produces error (see below.)

这是错误消息,一个ValueError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-248-bd1982955a12> in <module>
----> 1 func()
<ipython-input-247-d2024ad1ac0c> in func()
24                                                     random_state=0)
25     text_transformer.fit(data['text'])
---> 26     preprocessor.fit(X_train)
27 
28     print(X_train.shape, y_train.shape)
C:ProgramDataAnaconda3libsite-packagessklearncompose_column_transformer.py in fit(self, X, y)
492         # we use fit_transform to make sure to set sparse_output_ (for which we
493         # need the transformed data) to have consistent output type in predict
--> 494         self.fit_transform(X, y=y)
495         return self
496 
C:ProgramDataAnaconda3libsite-packagessklearncompose_column_transformer.py in fit_transform(self, X, y)
551         self._validate_output(Xs)
552 
--> 553         return self._hstack(list(Xs))
554 
555     def transform(self, X):
C:ProgramDataAnaconda3libsite-packagessklearncompose_column_transformer.py in _hstack(self, Xs)
637         else:
638             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 639             return np.hstack(Xs)
640 
641     def _sk_visual_block_(self):
<__array_function__ internals> in hstack(*args, **kwargs)
C:ProgramDataAnaconda3libsite-packagesnumpycoreshape_base.py in hstack(tup)
343         return _nx.concatenate(arrs, 0)
344     else:
--> 345         return _nx.concatenate(arrs, 1)
346 
347 
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 4179

难度由第一行创建

text_features = data.select_dtypes(include=['object']).columns

返回pandas索引对象。

检查ColumnTransformer的文档,我们发现对于columns参数;当transformer期望X是类似于(向量(的1d数组时,应使用标量字符串或int,否则将向transformer传递2d数组">

为了按名称或数据类型选择多个列,我们需要使用make_column_selector。很简单:只需将index对象传递给make_column_selector,一切都会好起来:

from sklearn.compose import make_column_selector
...
('text', text_transformer, make_column_selector(text_features))

或者,如果列有稳定的名称,则可以在管道中分多个步骤传递列的名称。灵活性较低,但可能更可读。

最新更新