科学学习中分类缺失值的估算



我得到了一些文本类型列的pandas数据。除了这些文本列,还有一些NaN值。我要做的是用sklearn.preprocessing.Imputer来推算这些NaN(用最频繁的值替换NaN)问题在于执行。假设存在一个包含30列的Pandas数据框架df,其中10列属于分类性质。一旦我运行:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python生成一个error: 'could not convert string to float: 'run1'',其中'run1'是包含分类数据的第一列中的一个普通(非缺失)值。

欢迎任何帮助

要为数值列使用平均值,为非数值列使用最常见的值,您可以这样做。你可以进一步区分整数和浮点数。我想使用整数列的中位数可能更有意义。

import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)
data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]
X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)
print('before...')
print(X)
print('after...')
print(xt)

打印,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

可以使用sklearn_pandas.CategoricalImputer作为分类列。细节:

首先,(从书动手机器学习与Scikit-Learn和TensorFlow)你可以有数字和字符串/分类特征的子管道,其中每个子管道的第一个转换器是一个选择器,它接受一个列名列表(和full_pipeline.fit_transform()需要一个pandas DataFrame):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

您可以将这些子管道与sklearn.pipeline.FeatureUnion组合,例如:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

现在,在num_pipeline中,您可以简单地使用sklearn.preprocessing.Imputer(),但在cat_pipline中,您可以使用sklearn_pandas包中的CategoricalImputer()

注: sklearn-pandas包可以与pip install sklearn-pandas一起安装,但它是作为import sklearn_pandas导入的

有一个包sklearn-pandas,它有一个分类变量的输入选项https://github.com/scikit-learn-contrib/sklearn-pandas categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)
  • strategy = ' most_frequency '只能用于定量特征,不能用于定性特征。这种自定义impuer可用于定性和定量。同样,对于scikit learn imputer,我们可以将其用于整个数据帧(如果所有特征都是定量的),或者我们可以将"for循环"用于类似类型的特征/列列表(参见下面的示例)。但自定义输入器可以与任何组合使用。

        from sklearn.preprocessing import Imputer
        impute = Imputer(strategy='mean')
        for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
              xx[cols] = impute.fit_transform(xx[[cols]])
    
  • 自定义输入器:

       from sklearn.preprocessing import Imputer
       from sklearn.base import TransformerMixin
       class CustomImputer(TransformerMixin):
             def __init__(self, cols=None, strategy='mean'):
                   self.cols = cols
                   self.strategy = strategy
             def transform(self, df):
                   X = df.copy()
                   impute = Imputer(strategy=self.strategy)
                   if self.cols == None:
                          self.cols = list(X.columns)
                   for col in self.cols:
                          if X[col].dtype == np.dtype('O') : 
                                 X[col].fillna(X[col].value_counts().index[0], inplace=True)
                          else : X[col] = impute.fit_transform(X[[col]])
                   return X
             def fit(self, *_):
                   return self
    
  • Dataframe:

          X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                     francisco', 'tokyo'], 
              'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
              'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                'somewhat like', 'dislike'], 
              'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
                city              boolean   ordinal_column  quantitative_column
            0   tokyo             yes       somewhat like   1.0
            1   NaN               no        like            11.0
            2   london            NaN       somewhat like   -0.5
            3   seattle           no        like            10.0
            4   san francisco     no        somewhat like   NaN
            5   tokyo             yes       dislike         20.0
    
  • 1)可以与类似类型的功能列表一起使用。

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
     cci.fit_transform(X)
    
  • 可与strategy = median一起使用

     sd = CustomImputer(['quantitative_column'], strategy = 'median')
     sd.fit_transform(X)
    
  • 3)可用于整个数据帧,它将使用默认平均值(或者我们也可以用中位数改变它)。对于定性特征,它使用strategy = ' most_frequency '和定量平均值/中位数。

     call = CustomImputer()
     call.fit_transform(X)   
    

我复制并修改了sveitser的答案,制作了一个熊猫的输入器。系列对象

import numpy
import pandas 
from sklearn.base import TransformerMixin
class SeriesImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  
        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self
    def transform(self, X, y=None):
       return X.fillna(self.fill)

要使用它,你可以这样做:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])

a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series

受到这里的答案的启发,并且为了所有用例的goto Imputer的需要,我最终写了这篇文章。mean, mode, median, fill可在pd.DataFramePd.Series上同时工作。

meanmedian只适用于数值型数据,modefill适用于数值型和分类型数据。

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler
    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is 
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self
   def transform(self, X, y=None):
       return X.fillna(self.fill)
使用

>> df   
    MasVnrArea  FireplaceQu
Id  
1   196.0   NaN
974 196.0   NaN
21  380.0   Gd
5   350.0   TA
651 NaN     Gd

>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   Gd
974 196.0   Gd
21  380.0   Gd
5   350.0   TA
651 196.0   Gd
>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   NA
974 196.0   NA
21  380.0   Gd
5   350.0   TA
651 0.0     Gd 

这段代码用最常见的类别填充一个系列:

import pandas as pd
import numpy as np
# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan
print('m = ')
print(m)
#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 
def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x
new_m = m.map(replace_most_common) #apply function to original data
print('new_m = ')
print(new_m)

输出:

m =
0      a
1    NaN
2      c
3      a
dtype: object
new_m =
0    a
1    a
2    c
3    a
dtype: object

sklearn.impute。SimpleImputer代替Imputer可以很容易地解决这个问题,它可以处理分类变量。

根据Sklearn文档:如果为"most_frequent",则使用每列中最频繁的值替换missing。可用于字符串或数字数据。

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

impute_size=SimpleImputer(strategy="most_frequent") 
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])

Missforest可以与其他分类特征一起用于分类变量缺失值的拟合。它以迭代的方式工作,类似于以随机森林为基础模型的IterativeImputer。

下面的代码用于标记将特征与目标变量一起编码,拟合模型以估算nan值,并将特征编码回

import sklearn.neighbors._base
from sklearn.preprocessing import LabelEncoder
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
def label_encoding(df, columns):
    """
    Label encodes the set of the features to be used for imputation
    Args:
        df: data frame (processed data)
        columns: list (features to be encoded)
    Returns: dictionary
    """
    encoders = dict()
    for col_name in columns:
        series = df[col_name]
        label_encoder = LabelEncoder()
        df[col_name] = pd.Series(
            label_encoder.fit_transform(series[series.notnull()]),
            index=series[series.notnull()].index
        )
        encoders[col_name] = label_encoder
    return encoders
# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest 
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
    data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))

相似。修改Imputer for strategy='most_frequent':

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)
    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)
    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

其中pandas.DataFrame.mode()查找每列中最常见的值,然后pandas.DataFrame.fillna()用这些值填充缺失的值。其他strategy值仍然由Imputer以相同的方式处理。

您可以尝试以下操作:

replace = df.<yourcolumn>.value_counts().argmax()
df['<yourcolumn>'].fillna(replace, inplace=True) 

这是我基于@Gautham Kumaran思想的多重imputation尝试。它将使用模式"最频繁"进行分类变量替换,然后通过回归对数值变量进行多次插入

# mising values imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.base import BaseEstimator, TransformerMixin
# class for missing data imputation
# =============================================================
class MVImputer(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=None, filler='NA'):
        self.random_state = random_state
        self.fill = filler
        
    def fit(self, X, y=None):
        categorical_dtypes = ['object', 'category', 'bool']
        numerical_dtypes = ['float', 'int']
        for col in X.columns:
            if X[col].dtype.name in categorical_dtypes:
                self.fill = X.mode().iloc[0]
            elif X[col].dtype.name in numerical_dtypes:
                min_val = X[col].min(axis=0)
                max_val = X[col].max(axis=0)
                imputer = (IterativeImputer(max_iter=10,
                                            random_state=self.random_state,
                                            min_value=min_val, 
                                            max_value=max_val))
                self.fill = imputer.fit(X)
        return self
    
    def transform(self, X, y=None):
        return X.fillna(self.fill)
# call for single imputed dataframe
imp = MVImputer()
imp.fit_transform(df) 
# multiple imputed dict of dataframes
mvi = {}
for i in range(3):
    imp = Imputer()
    mvi[i] = imp.fit_transform(df)

相关内容

  • 没有找到相关文章

最新更新