Label在训练集和测试集上具有相同变换的多列上的编码



我有一个pandas数据框架,其中有一些列,其中10列是分类的,我想使用LabelEncoder标记编码它们。但是,我想在训练集和测试集上使用相同的变换。我这样做:

categorical_columns = train.columns[:10].tolist()       # List of categorical columns: [c0, c1, c2 ... c9]
le = LabelEncoder()
le.fit(categorical_columns)
train[categorical_columns] = le.transform(train[categorical_columns])
test[categorical_columns] = le.transform(test[categorical_columns])

但是这段代码给了我以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-a43d4dd9a428> in <module>
4 le.fit(categorical_columns)
5 
----> 6 train[categorical_columns] = le.transform(train[categorical_columns])
7 test[categorical_columns] = le.transform(test[categorical_columns])
/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py in transform(self, y)
270         """
271         check_is_fitted(self)
--> 272         y = column_or_1d(y, warn=True)
273         # transform of empty array is empty array
274         if _num_samples(y) == 0:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70                           FutureWarning)
71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
73     return inner_f
74 
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
845     raise ValueError(
846         "y should be a 1d array, "
--> 847         "got an array of shape {} instead.".format(shape))
848 
849 
ValueError: y should be a 1d array, got an array of shape (300000, 10) instead.

我该如何正确地做呢?

LabelEncoder只能用于编码y值。

对于分类的X特征,您应该使用sklearn.preprocessing.OrdinalEncoder来执行此操作。

如下:

from sklearn.preprocessing import OrdinalEncoder
X = [['Male', 'X'], ['Female', 'Y'], ['Female', 'Z']]
OrdinalEncoder().fit_transform(X)

输出:

array([[1., 0.],
[0., 1.],
[0., 2.]])

可以用OneHotEncoder代替LabelEncoder:

>>> df
Sex Cabin Embarked
0  female   C85        C
1  female  C123        S
2    male   E46        S
3  female    G6        S
4  female  C103        S
5    male   D56        S
6    male    A6        S
from sklearn.preprocessing import OneHotEncoder
arr = OneHotEncoder().fit_transform(df).toarray()
>>> arr
#       Sex   , Cabin                         , Embarked 
array([[1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.],
[1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.],
[0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1.]])

要了解更多,请阅读这篇文章(有点旧)

最新更新