Python TypeError:稀疏矩阵长度是模糊的;使用getnz()或shape[0]



我想对数据集的变量进行一次性编码。我的代码是提高TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0].

Dataframe

print(df.head())
country  year     sex          age  suicides_no  population  
0  Albania  1987    male  15-24 years           21      312900   
1  Albania  1987    male  35-54 years           16      308000   
2  Albania  1987  female  15-24 years           14      289700   
3  Albania  1987    male    75+ years            1       21800   
4  Albania  1987    male  25-34 years            9      274300   
suicides/100k pop country-year  HDI for year   gdp_for_year ($)   
0               6.71  Albania1987           NaN        2.156625e+09   
1               5.19  Albania1987           NaN        2.156625e+09   
2               4.83  Albania1987           NaN        2.156625e+09   
3               4.59  Albania1987           NaN        2.156625e+09   
4               3.28  Albania1987           NaN        2.156625e+09   
gdp_per_capita ($)       generation  
0                 796     Generation X  
1                 796           Silent  
2                 796     Generation X  
3                 796  G.I. Generation  
4                 796          Boomers

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
df['year_label'] = ohe.fit_transform(df['year'].to_numpy().reshape(-1, 1))
df['year_label'].unique()

回溯

> --------------------------------------------------------------------------- TypeError                                 Traceback (most recent call
> last) /tmp/ipykernel_6768/3587352959.py in <module>
>       1 # One-hot encoding
>       2 ohe = OneHotEncoder()
> ----> 3 df['year_label'] = ohe.fit_transform(df['year'].to_numpy().reshape(-1, 1))
>       4 df['year_label'].unique()
>       5 df['sex_label'] = ohe.fit_transform(df['sex'])
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in __setitem__(self, key, value)    3610         else:    3611        
> # set column
> -> 3612             self._set_item(key, value)    3613     3614     def _setitem_slice(self, key: slice, value):
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in _set_item(self, key, value)    3782         ensure homogeneity.   
> 3783         """
> -> 3784         value = self._sanitize_column(value)    3785     3786         if (
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in _sanitize_column(self, value)    4507     4508         if
> is_list_like(value):
> -> 4509             com.require_length_match(value, self.index)    4510         return sanitize_array(value, self.index, copy=True,
> allow_2d=True)    4511 
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/common.py
> in require_length_match(data, index)
>     528     Check the length of data matches the length of the index.
>     529     """
> --> 530     if len(data) != len(index):
>     531         raise ValueError(
>     532             "Length of values "
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/scipy/sparse/base.py
> in __len__(self)
>     289     # non-zeros is more important.  For now, raise an exception!
>     290     def __len__(self):
> --> 291         raise TypeError("sparse matrix length is ambiguous; use getnnz()"
>     292                         " or shape[0]")
>     293 
> 
> TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

有一种使用pandas.get_dummies对pandas中的变量进行单热编码的简单方法。

如下:

import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])

Ouptut:

C  col1_a  col1_b  col2_a  col2_b  col2_c
0  1       1       0       0       1       0
1  2       0       1       1       0       0
2  3       1       0       0       0       1

然后您可以简单地将结果与您的DataFrame合并。

创建一个简单的数据框架:

In [20]: x = np.array([1987,1987, 1986, 1985])
In [21]: df = pd.DataFrame(x[:,None], columns=['x'])
In [22]: df
Out[22]: 
x
0  1987
1  1987
2  1986
3  1985
In [23]: one=OneHotEncoder()
In [24]: one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Out[24]: 
<4x3 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>

one返回scipy.sparse矩阵,如文档所示。

尝试将结果分配给数据框架列会产生错误:

In [25]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Traceback (most recent call last):
File "<ipython-input-25-b30a637ba61b>", line 1, in <module>
df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 3612, in __setitem__
self._set_item(key, value)
...

设置pandas_set_item行。这是赋值操作

我们可以告诉OneHotEncode返回一个密集的numpy数组:

In [27]: one=OneHotEncoder(sparse=False)
In [28]: one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Out[28]: 
array([[0., 0., 1.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])

但是,尝试将其分配给数据框架的一个列仍然会产生错误。这个数组有3列,每个列对应一个唯一的值。

In [29]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'new'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 3751, in _set_item_mgr
loc = self._info_axis.get_loc(key)
...
ValueError: Wrong number of items passed 3, placement implies 1

但如果我将数组转换为列表的列表,它确实有效。它现在在new列的每个单元格中放置一个列表:

In [41]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1)).tolist()
In [42]: df
Out[42]: 
x              new
0  1987  [0.0, 0.0, 1.0]
1  1987  [0.0, 0.0, 1.0]
2  1986  [0.0, 1.0, 0.0]
3  1985  [1.0, 0.0, 0.0]

可能有一个pandas方法可以将这些列表分成单独的列,但我不是pandas专家。

最新更新