sklearn:无法使OneHotEncoder与Pipeline一起工作



我正在使用ColumnTransformer为模型构建一个管道。这是我的管道的样子,

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')

def build_pipeline_with_estimator(estimator):
return Pipeline([
('imputer',imputer_transformer),
('category_transformer',category_transformer),
('estimator',estimator),
])

这是我的数据集的样子,

kms_driven      owner   location    mileage     power    brand              engine  age
34000.0         first       other           NaN         12.0        Yamaha          150.0     9
28000.0         first       other           72.0         7.0         Hero                100.0    16
5947.0           first       other          53.0          19.0       Bajaj                NaN       4
11000.0         first       delhi           40.0          19.8       Royal Enfield   350.0    7
13568.0         first       delhi           63.0          14.0       Suzuki             150.0     5

这就是我如何使用线性回归与我的管道。

linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)
print('Linear Regression Train Performance.n')
print(model_perf(linear_regressor,X_train,y_train))
print('Linear Regression Test Performance.n')
print(model_perf(linear_regressor,X_test,y_test))

现在,每当我尝试对管道应用线性回归时,我都会得到这个错误,

ValueError: could not convert string to float: 'bangalore'

'banglore'是位置功能中的值之一,我正试图对其进行一热编码,但它失败了,我无法找出这里出了什么问题。如有任何帮助,不胜感激。

传递传入器后,未传入的列将向右移动,如文档下的注释所述:

原始特征矩阵中未指定的列为从变换后的特征矩阵中删除,除非由passthrough关键字指定。指定的列在变压器输出的右侧添加直通。

我们可以尝试先使用输入器:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')

我们可以用一个示例数据来尝试,你会看到你的分类列现在向右移动:

X_train = pd.DataFrame({'kms':[0,1,2],'owner':['first','first','second'],
'location':['other','other','delhi'],'mileage':[9,8,np.nan],
'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5]})
imputer_transformer.fit_transform(X_train)
Out[25]: 
array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
[1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
[2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)

在您的示例中,您可以看到engine列现在是第四列,而您的序数是第五列,即最后两列,因此一个简单的解决方案可能是:

category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
handle_unknown='ignore',dtype=np.int16),[5]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
],remainder='passthrough')
y_train = [7,3,2]
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)

最新更新