pandassimpleimpulator保存数据类型



我在下面的代码中遇到一个简单的错误。

我的目标是使用simpleimpuller在一个快照中插入不同数据类型的缺失值。

当我尝试这样做时,fit_transform似乎没有按预期工作。当不使用dtype参数时,代码运行良好,但生成的数据帧会丢失其数据类型信息。当我在参数中包括dtype列表时,我看到了以下错误。您应该能够通过在本地复制和粘贴来模拟错误。

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import sklearn
print(sklearn.__version__)
0.21.dev0
data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
df = pd.DataFrame(data,columns=['Name','State','Age','Height'])
df.dtypes
Name       object
State      object
Age       float64
Height    float64
dtype: object                 
imp = SimpleImputer(strategy="most_frequent")
#df = pd.DataFrame(imp.fit_transform(df),columns=df.columns)   <<<<----- This works just fine
#df
#Name   State   Age Height
#0  Alex    NJ  21  5.1
#1  Mary    NY  20  5.1
#2  Sam NJ  20  6.3
#df.dtypes
#Name      object
#State     object
#Age       object
#Height    object
#dtype: object

以下语句失败-错误如下(我试图在输入过程中保留数据类型(

df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-e9780979921f> in <module>()
7 
8 imp = SimpleImputer(strategy="most_frequent")
----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py in __init__(self, data, index, columns, dtype, copy)
337             data = {}
338         if dtype is not None:
--> 339             dtype = self._validate_dtype(dtype)
340 
341         if isinstance(data, DataFrame):
~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in _validate_dtype(self, dtype)
166 
167         if dtype is not None:
--> 168             dtype = pandas_dtype(dtype)
169 
170             # a compound dtype
~AppDataLocalContinuumanaconda3libsite-packagespandascoredtypescommon.py in pandas_dtype(dtype)
2020     # which we safeguard against by catching them earlier and returning
2021     # np.dtype(valid_dtype) before this condition is evaluated.
-> 2022     if dtype in [object, np.object_, 'object', 'O']:
2023         return npdtype
2024     elif npdtype.kind == 'O':
~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in __nonzero__(self)
1574         raise ValueError("The truth value of a {0} is ambiguous. "
1575                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576                          .format(self.__class__.__name__))
1577 
1578     __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如果您想保留dtype,我建议使用panda查找模式,然后调用fillna:

df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
print(df)
Name State   Age  Height
0  Alex    NJ  21.0     5.1
1  Mary    NY  20.0     5.1
2   Sam    NJ  20.0     6.3
print(df.dtypes)
Name       object
State      object
Age       float64
Height    float64
dtype: object

或者,使用astype并传递字典:

df = pd.DataFrame(
imp.fit_transform(df), columns=df.columns
).astype(df.dtypes.to_dict())
print(df)
Name State   Age  Height
0  Alex    NJ  21.0     5.1
1  Mary    NY  20.0     5.1
2   Sam    NJ  20.0     6.3
print(df.dtypes)
Name       object
State      object
Age       float64
Height    float64
dtype: object

需要显式astype调用,因为根据文档,只有一个dtype可以传递给pd.DataFrame构造函数。

?pd.DataFrame
...
dtype : dtype, default None
|      Data type to force. Only a single dtype is allowed.

最新更新