LabelBinarizer 的行为不一致，因为 NaN 的

我正在尝试将带有DataFrame文本的列转换为一个热编码矩阵。这在一段时间内运行良好，但由于我未知的原因已停止工作。消息中写道："在'str'和'float'的实例之间不支持TypeError:'>'"对我来说，这似乎是无稽之谈，因为我只使用tekst数据。当我用一个小数据集重复实验时，LabelBinarizer工作得很好，并产生了所需的输出。

我注意到X_train数据帧的大小为4.6 GB。我的机器只有8GB。我应该意识到内存有限制吗？所有的数字都很小，我应该转换成int32和float32吗？

我可以重现下面的错误。但我不确定这是否提供了足够的信息。

from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
s=['a','b','c','b','a']
df=pd.DataFrame (s)
df = pd.Series (s)
dd = X_train['state']
type(dd)
Out[9]: pandas.core.series.Series
type(df)
Out[10]: pandas.core.series.Series
lb.fit(dd)
Traceback (most recent call last):
File "<ipython-input-11-5ec245111e31>", line 1, in <module>
lb.fit(dd)
File "C:packagesAnaconda3libsite-packagessklearnpreprocessinglabel.py", line 296, in fit
self.y_type_ = type_of_target(y)
File "C:packagesAnaconda3libsite-packagessklearnutilsmulticlass.py", line 275, in type_of_target
if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1):
File "C:packagesAnaconda3libsite-packagesnumpylibarraysetops.py", line 214, in unique
ar.sort()
TypeError: '>' not supported between instances of 'str' and 'float'

lb.fit(df)
Out[12]: LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
df.value_counts()
Out[13]: 
a    2
b    2
c    1
dtype: int64
dd.value_counts()
Out[14]: 
MI    228601
CA      5020
TX      2420
FL      2237
IL      1310
SC      1304
OH       967
NY       673
MN       632
GA       535
NV       484
UT       477
PA       466
NJ       395
VA       385
NC       353
MD       349
AZ       329
ME       261
OK       248
AL       215
TN       207
WA       192
MA       182
IA       159
WI       159
OR       153
MO       151
CO       147
KY       146
IN       106
AR        82
LA        81
AK        79
UK        77
NB        77
MS        64
CT        60
DC        58
ON        51
DE        50
KS        37
RI        35
SD        33
ID        33
MT        28
NM        21
BC        17
WY        12
HI        10
NH         9
VT         7
VI         6
WV         6
PR         5
QC         5
QL         3
ND         2
BL         2
Name: state, dtype: int64
len(df)
Out[15]: 5
len(dd)
Out[16]: 250306

也许它的输入数据可能包含丢失的值。

from sklearn.preprocessing import LabelBinarizer
import numpy as np
import pandas as pd
lb = LabelBinarizer()
s = ['a','b','c','b','a', np.nan]
df = pd.DataFrame(s, columns=["state"])
df_binarized = lb.fit_transform(df['state'])
df_binarized
Traceback (most recent call last):
File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-45-f16e01b4e1be>", line 4, in <module>
df_binarized = lb.fit_transform(df['state'])
File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 296, in fit
self.y_type_ = type_of_target(y)
File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 275, in type_of_target
if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1):
File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 210, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 277, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'float' and 'str'

如果没有丢失的值，它的工作方式如下。

from sklearn.preprocessing import LabelBinarizer
import numpy as np
import pandas as pd
s = ['a','b','c','b','a']
df = pd.DataFrame(s, columns=["state"])
df_binarized = lb.fit_transform(df['state'])
df_binarized
Out[46]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 0],
[1, 0, 0]])

相关内容

最新更新

热门标签：