我有一个pandas数据框架,其中有一些列,其中10列是分类的,我想使用LabelEncoder
标记编码它们。但是,我想在训练集和测试集上使用相同的变换。我这样做:
categorical_columns = train.columns[:10].tolist() # List of categorical columns: [c0, c1, c2 ... c9]
le = LabelEncoder()
le.fit(categorical_columns)
train[categorical_columns] = le.transform(train[categorical_columns])
test[categorical_columns] = le.transform(test[categorical_columns])
但是这段代码给了我以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-a43d4dd9a428> in <module>
4 le.fit(categorical_columns)
5
----> 6 train[categorical_columns] = le.transform(train[categorical_columns])
7 test[categorical_columns] = le.transform(test[categorical_columns])
/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py in transform(self, y)
270 """
271 check_is_fitted(self)
--> 272 y = column_or_1d(y, warn=True)
273 # transform of empty array is empty array
274 if _num_samples(y) == 0:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
845 raise ValueError(
846 "y should be a 1d array, "
--> 847 "got an array of shape {} instead.".format(shape))
848
849
ValueError: y should be a 1d array, got an array of shape (300000, 10) instead.
我该如何正确地做呢?
LabelEncoder
只能用于编码y
值。
对于分类的X
特征,您应该使用sklearn.preprocessing.OrdinalEncoder
来执行此操作。
如下:
from sklearn.preprocessing import OrdinalEncoder
X = [['Male', 'X'], ['Female', 'Y'], ['Female', 'Z']]
OrdinalEncoder().fit_transform(X)
输出:
array([[1., 0.],
[0., 1.],
[0., 2.]])
可以用OneHotEncoder
代替LabelEncoder
:
>>> df
Sex Cabin Embarked
0 female C85 C
1 female C123 S
2 male E46 S
3 female G6 S
4 female C103 S
5 male D56 S
6 male A6 S
from sklearn.preprocessing import OneHotEncoder
arr = OneHotEncoder().fit_transform(df).toarray()
>>> arr
# Sex , Cabin , Embarked
array([[1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.],
[1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.],
[0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1.]])
要了解更多,请阅读这篇文章(有点旧)