我得到了一些文本类型列的pandas数据。除了这些文本列,还有一些NaN值。我要做的是用sklearn.preprocessing.Imputer
来推算这些NaN(用最频繁的值替换NaN)问题在于执行。假设存在一个包含30列的Pandas数据框架df,其中10列属于分类性质。一旦我运行:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)
Python生成一个error: 'could not convert string to float: 'run1''
,其中'run1'是包含分类数据的第一列中的一个普通(非缺失)值。
欢迎任何帮助
要为数值列使用平均值,为非数值列使用最常见的值,您可以这样做。你可以进一步区分整数和浮点数。我想使用整数列的中位数可能更有意义。
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
Columns of dtype object are imputed with the most frequent value
in column.
Columns of other types are imputed with mean of column.
"""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
data = [
['a', 1, 2],
['b', 1, 1],
['b', 2, 2],
[np.nan, np.nan, np.nan]
]
X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)
print('before...')
print(X)
print('after...')
print(xt)
打印,
before...
0 1 2
0 a 1 2
1 b 1 1
2 b 2 2
3 NaN NaN NaN
after...
0 1 2
0 a 1.000000 2.000000
1 b 1.000000 1.000000
2 b 2.000000 2.000000
3 b 1.333333 1.666667
可以使用sklearn_pandas.CategoricalImputer
作为分类列。细节:
首先,(从书动手机器学习与Scikit-Learn和TensorFlow)你可以有数字和字符串/分类特征的子管道,其中每个子管道的第一个转换器是一个选择器,它接受一个列名列表(和full_pipeline.fit_transform()
需要一个pandas DataFrame):
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
您可以将这些子管道与sklearn.pipeline.FeatureUnion
组合,例如:
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline)
])
现在,在num_pipeline
中,您可以简单地使用sklearn.preprocessing.Imputer()
,但在cat_pipline
中,您可以使用sklearn_pandas
包中的CategoricalImputer()
。
注: sklearn-pandas
包可以与pip install sklearn-pandas
一起安装,但它是作为import sklearn_pandas
导入的
有一个包sklearn-pandas
,它有一个分类变量的输入选项https://github.com/scikit-learn-contrib/sklearn-pandas categoricalimputer
>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)
-
strategy = ' most_frequency '只能用于定量特征,不能用于定性特征。这种自定义impuer可用于定性和定量。同样,对于scikit learn imputer,我们可以将其用于整个数据帧(如果所有特征都是定量的),或者我们可以将"for循环"用于类似类型的特征/列列表(参见下面的示例)。但自定义输入器可以与任何组合使用。
from sklearn.preprocessing import Imputer impute = Imputer(strategy='mean') for cols in ['quantitative_column', 'quant']: # here both are quantitative features. xx[cols] = impute.fit_transform(xx[[cols]])
-
自定义输入器:
from sklearn.preprocessing import Imputer from sklearn.base import TransformerMixin class CustomImputer(TransformerMixin): def __init__(self, cols=None, strategy='mean'): self.cols = cols self.strategy = strategy def transform(self, df): X = df.copy() impute = Imputer(strategy=self.strategy) if self.cols == None: self.cols = list(X.columns) for col in self.cols: if X[col].dtype == np.dtype('O') : X[col].fillna(X[col].value_counts().index[0], inplace=True) else : X[col] = impute.fit_transform(X[[col]]) return X def fit(self, *_): return self
-
Dataframe:
X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san francisco', 'tokyo'], 'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 'somewhat like', 'dislike'], 'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]}) city boolean ordinal_column quantitative_column 0 tokyo yes somewhat like 1.0 1 NaN no like 11.0 2 london NaN somewhat like -0.5 3 seattle no like 10.0 4 san francisco no somewhat like NaN 5 tokyo yes dislike 20.0
-
1)可以与类似类型的功能列表一起使用。
cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean cci.fit_transform(X)
-
可与strategy = median一起使用
sd = CustomImputer(['quantitative_column'], strategy = 'median') sd.fit_transform(X)
-
3)可用于整个数据帧,它将使用默认平均值(或者我们也可以用中位数改变它)。对于定性特征,它使用strategy = ' most_frequency '和定量平均值/中位数。
call = CustomImputer() call.fit_transform(X)
我复制并修改了sveitser的答案,制作了一个熊猫的输入器。系列对象
import numpy
import pandas
from sklearn.base import TransformerMixin
class SeriesImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
If the Series is of dtype Object, then impute with the most frequent object.
If the Series is not of dtype Object, then impute with the mean.
"""
def fit(self, X, y=None):
if X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
else : self.fill = X.mean()
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
要使用它,你可以这样做:
# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])
a = SeriesImputer() # Initialize the imputer
a.fit(s1) # Fit the imputer
s2 = a.transform(s1) # Get a new series
受到这里的答案的启发,并且为了所有用例的goto Imputer的需要,我最终写了这篇文章。mean, mode, median, fill
可在pd.DataFrame
和Pd.Series
上同时工作。
mean
和median
只适用于数值型数据,mode
和fill
适用于数值型和分类型数据。
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, strategy='mean',filler='NA'):
self.strategy = strategy
self.fill = filler
def fit(self, X, y=None):
if self.strategy in ['mean','median']:
if not all(X.dtypes == np.number):
raise ValueError('dtypes mismatch np.number dtype is
required for '+ self.strategy)
if self.strategy == 'mean':
self.fill = X.mean()
elif self.strategy == 'median':
self.fill = X.median()
elif self.strategy == 'mode':
self.fill = X.mode().iloc[0]
elif self.strategy == 'fill':
if type(self.fill) is list and type(X) is pd.DataFrame:
self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
使用>> df
MasVnrArea FireplaceQu
Id
1 196.0 NaN
974 196.0 NaN
21 380.0 Gd
5 350.0 TA
651 NaN Gd
>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea FireplaceQu
Id
1 196.0 Gd
974 196.0 Gd
21 380.0 Gd
5 350.0 TA
651 196.0 Gd
>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea FireplaceQu
Id
1 196.0 NA
974 196.0 NA
21 380.0 Gd
5 350.0 TA
651 0.0 Gd
这段代码用最常见的类别填充一个系列:
import pandas as pd
import numpy as np
# create fake data
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan
print('m = ')
print(m)
#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0]
def replace_most_common(x):
if pd.isnull(x):
return most_common
else:
return x
new_m = m.map(replace_most_common) #apply function to original data
print('new_m = ')
print(new_m)
输出:
m =
0 a
1 NaN
2 c
3 a
dtype: object
new_m =
0 a
1 a
2 c
3 a
dtype: object
sklearn.impute。SimpleImputer代替Imputer可以很容易地解决这个问题,它可以处理分类变量。
根据Sklearn文档:如果为"most_frequent",则使用每列中最频繁的值替换missing。可用于字符串或数字数据。
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.htmlimpute_size=SimpleImputer(strategy="most_frequent")
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])
Missforest可以与其他分类特征一起用于分类变量缺失值的拟合。它以迭代的方式工作,类似于以随机森林为基础模型的IterativeImputer。
下面的代码用于标记将特征与目标变量一起编码,拟合模型以估算nan值,并将特征编码回
import sklearn.neighbors._base
from sklearn.preprocessing import LabelEncoder
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
def label_encoding(df, columns):
"""
Label encodes the set of the features to be used for imputation
Args:
df: data frame (processed data)
columns: list (features to be encoded)
Returns: dictionary
"""
encoders = dict()
for col_name in columns:
series = df[col_name]
label_encoder = LabelEncoder()
df[col_name] = pd.Series(
label_encoder.fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
encoders[col_name] = label_encoder
return encoders
# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))
相似。修改Imputer
for strategy='most_frequent'
:
class GeneralImputer(Imputer):
def __init__(self, **kwargs):
Imputer.__init__(self, **kwargs)
def fit(self, X, y=None):
if self.strategy == 'most_frequent':
self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
self.statistics_ = self.fills.values
return self
else:
return Imputer.fit(self, X, y=y)
def transform(self, X):
if hasattr(self, 'fills'):
return pd.DataFrame(X).fillna(self.fills).values.astype(str)
else:
return Imputer.transform(self, X)
其中pandas.DataFrame.mode()
查找每列中最常见的值,然后pandas.DataFrame.fillna()
用这些值填充缺失的值。其他strategy
值仍然由Imputer
以相同的方式处理。
您可以尝试以下操作:
replace = df.<yourcolumn>.value_counts().argmax()
df['<yourcolumn>'].fillna(replace, inplace=True)
这是我基于@Gautham Kumaran思想的多重imputation尝试。它将使用模式"最频繁"进行分类变量替换,然后通过回归对数值变量进行多次插入
# mising values imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.base import BaseEstimator, TransformerMixin
# class for missing data imputation
# =============================================================
class MVImputer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=None, filler='NA'):
self.random_state = random_state
self.fill = filler
def fit(self, X, y=None):
categorical_dtypes = ['object', 'category', 'bool']
numerical_dtypes = ['float', 'int']
for col in X.columns:
if X[col].dtype.name in categorical_dtypes:
self.fill = X.mode().iloc[0]
elif X[col].dtype.name in numerical_dtypes:
min_val = X[col].min(axis=0)
max_val = X[col].max(axis=0)
imputer = (IterativeImputer(max_iter=10,
random_state=self.random_state,
min_value=min_val,
max_value=max_val))
self.fill = imputer.fit(X)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
# call for single imputed dataframe
imp = MVImputer()
imp.fit_transform(df)
# multiple imputed dict of dataframes
mvi = {}
for i in range(3):
imp = Imputer()
mvi[i] = imp.fit_transform(df)