我正试图在数据集的第三列中对有序类别值进行编码,其中"Tiny Mongra"的值最低,"1st Wand"的值最高。它是使用小、中、大尺寸的同义词,其中当前数据集表示一粒米的大小。
当我运行这个片段时,我一直得到以下错误:
Traceback (most recent call last):
File "<ipython-input-1-ae4501cc0ac1>", line 19, in <module>
X[:, 2] = ordinalencoder_X_3.fit_transform(X[:, 2])
File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 462, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 794, in fit
self._fit(X)
File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 61, in _fit
X = self._check_X(X)
File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 47, in _check_X
X_temp = check_array(X, dtype=None)
File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 552, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=['1st Wand' '1st Wand' '1st Wand' ... '1st Wand' '1st Wand' '1st Wand'].
经过进一步检查,我发现这个错误并不是在警告我分类数据列表,而是指我想要编码的列。出于某种原因,它认为该列是形式为的1D阵列
array=['1st Wand' '1st Wand' '1st Wand' '1st Wand' '1st Wand' 'Dubar' '2nd Wand'
'Tibar' 'Mongra' '1st Wand' '1st Wand' '1st Wand' '1st Wand' '1st Wand'
'1st Wand' '2nd Wand' 'Super Dubar' 'Super Tibar' ... '1st Wand' '1st Wand'].
这很奇怪,因为我使用LabelEncoder来适应数据集中的其他分类值,并且它们工作得很好。
这是数据的链接。参见"数据"表:
https://docs.google.com/spreadsheets/d/12nAU5QztVnVroRYDsRDsZGUyBpBTwAD5yMmbMaAxnHQ/edit?usp=sharing
这是完整的代码。参考最后一部分:
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Ryze Price NN Data.csv')
X = dataset.iloc[:, 1:7].values
y = dataset.iloc[:, 7].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
# SEE THIS PART
category_array = ["Tiny Mongra","Mini Mongra","Mongra","Super Mongra","Mini Dubar","Dubar","Super Dubar","Mini Tibar","Tibar","Super Tibar","2nd Wand","Super 2nd Wand","1st Wand"]
ordinalencoder_X_3 = OrdinalEncoder(categories=category_array)
X[:, 2] = ordinalencoder_X_3.fit_transform(np.array(X[:,2])
我预计分类数据编码如下:"Tiny Mongra"应编码为0。。"第一根魔杖"应编码为12
LabelEncoder
和OrdinalEncoder
的主要区别在于它们的用途:
- CCD_ 3应用于目标变量
OrdinalEncoder
应用于特征变量
一般来说,它们的工作原理相同,但:
- CCD_ 5需要y:形状为[n_samples]的类阵列
OrdinalEncoder
需要X:类似阵列,形状[n_samples,n_features]
如果您只想将分类变量的值编码为0, 1, ..., n
,请使用与X1和X2相同的LabelEncoder
。
labelencoder_X_3 = LabelEncoder()
X[:, 2] = labelencoder_X_3.fit_transform(X[:, 2])
但我会同时用OrdinalEncoder
转换所有三个变量:
ordinalencoder_X = OrdinalEncoder()
X[:, 0:3] = ordinalencoder_X.fit_transform(X[:, 0:3])
另一个选项是使用Pandas Applymap函数并使用Lambda函数传递映射字典,而不是使用Ordinal Encoder。
这是映射字典:
mapping = { "Tiny Mongra" : 0,"Mini Mongra" : 1,"Mongra":2,"Super Mongra" : 3,"Mini
Dubar":4,"Dubar":5,"Super Dubar":6,"Mini Tibar":7,"Tibar":8,"Super Tibar":9,"2nd
Wand":10,"Super 2nd Wand" :11,"1st Wand":12}
下面是我的数据帧:
df = pd.DataFrame(['Tiny Mongra', 'Mini Dubar' ,'Mongra', '1st Wand' ,'1st Wand'
,'Dubar' ,'2nd Wand','Tibar', 'Mongra', 'Super Dubar', '1st Wand', '1st Wand', '1st
Wand' ,'1st Wand','1st Wand', '2nd Wand' ,'Super Dubar' ,'Super Tibar' ,'1st Wand',
'1st Wand'], columns = ['category'])
然后,您可以使用以下代码创建另一个编码映射列:
df['mapped_category'] = df.applymap(lambda x : mapping[x])
请尝试使用以下内容;
ordinal_encoder_X=序号编码器()X[:,0:3]=普通编码器_X.fit_transform([X]:,0:3]])
请尝试使用以下内容。
ordinal_encoder_X=序号编码器()X[:,0:3]=普通编码器_X.fit_transform(pd.DataFrame(X.iloc[:,0:3]))