标签编码器编码缺失值



我正在使用标签编码器将分类数据转换为数值。

标签编码器如何处理缺失值?

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)

输出:

array([1, 2, 3, 0, 4, 1])

对于上面的示例,标签编码器将 NaN 值更改为类别。我如何知道哪个类别表示缺失值?

不要使用带有缺失值的LabelEncoder。我不知道您使用的是哪个版本的scikit-learn,但是在 0.17.1 中,您的代码会引发TypeError: unorderable types: str() > float().

正如您在源代码中看到的那样,它对要编码的数据使用numpy.unique,如果找到缺失值,则会引发TypeError。如果要对缺失值进行编码,请先将其类型更改为字符串:

a[pd.isnull(a)]  = 'NaN'

您还可以在标记后使用掩码替换原始数据框

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN
original = df
mask = df_1.isnull()
       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True
df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)
A   B   C
0   1.0 0   1.0
1   NaN 1   0.0
2   2.0 2   NaN

你好,我为自己的工作做了一点计算技巧:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)

这是我的解决方案,因为我对这里发布的解决方案不满意。我需要一种LabelEncoder,将我的缺失值保留为NaN,以便以后使用估算器。所以我写了自己的LabelEncoder课。它适用于数据帧。

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()
    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self
    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].to_numpy()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed DataFrame
        return x

您可以输入数据帧,而不仅仅是 1 暗系列。 使用 col,您可以选择应编码的列。

我想在这里提供一些反馈。


我想与您分享我的解决方案。
我创建了一个模块,该模块采用混合数据集并将其从分类转换为数字和反之。

这个模块在我的Github中也可用,组织得很好,有例子。
如果您喜欢我的解决方案,请投赞成票。

啧,伊丹

class label_encoder_contain_missing_values :
        def __init__ (self) :    
            pass  
        def categorical_to_numeric (self,dataset):
            import numpy as np
            import pandas as pd
            
            self.dataset = dataset
            self.summary = None
            self.table_encoder= {}
            for index in self.dataset.columns :
                if self.dataset[index].dtypes == 'object' :               
                   column_data_frame = pd.Series(self.dataset[index],name='column').to_frame()
                   unique_values = pd.Series(self.dataset[index].unique())
                   i = 0
                   label_encoder = pd.DataFrame({'value_name':[],'Encode':[]})
                   while i <= len(unique_values)-1:
                         if unique_values.isnull()[i] == True : 
                           label_encoder = label_encoder.append({'value_name': unique_values[i],'Encode':np.nan}, ignore_index=True) #np.nan = -1
                         else:
                           label_encoder = label_encoder.append({'value_name': unique_values[i],'Encode':i}, ignore_index=True)
                         i+=1 
                   output = pd.merge(left=column_data_frame,right = label_encoder, how='left',left_on='column',right_on='value_name')
                   self.summary = output[['column','Encode']].drop_duplicates().reset_index(drop=True)
                   self.dataset[index] = output.Encode 
                   self.table_encoder.update({index:self.summary})
                    
                else :
                     pass
                     
            # ---- Show Encode Table ----- #               
            print('''nLabel Encoding completed in Successfully.n
                       Next steps: n
                       1.  To view table_encoder, Execute the follow: n
                           for index in table_encoder :
                           print(f'\n{index} \n',table_encoder[index])
                           
                       2. For inverse, execute the follow : n
                          df = label_encoder_contain_missing_values().
                               inverse_numeric_to_categorical(table_encoder, df) ''') 
                        
            return self.table_encoder  ,self.dataset 
        
        def inverse_numeric_to_categorical (self,table_encoder, df):
            dataset = df.copy()
            for column in table_encoder.keys():
                df_column = df[column].to_frame()
                output = pd.merge(left=df_column,right = table_encoder[column], how='left',left_on= column,right_on='Encode')#.rename(columns={'column_x' :'encode','column_y':'category'})
                df[column]= output.column
            print('nInverse Label Encoding, from categorical to numerical completed in Successfully.n')
            return df
            
**execute command from categorical to numerical** <br>
table_encoder, df = label_encoder_contain_missing_values().categorical_to_numeric(df) 
**execute command from numerical to categorical** <br>
df = label_encoder_contain_missing_values().inverse_numeric_to_categorical(table_encoder, df)

一个简单的方法是这样的

这是泰坦尼克号的一个例子

LABEL_COL = ["Sex", "Embarked"]
def label(df):
    _df = df.copy()
    le = LabelEncoder()
    for col in LABEL_COL:
        # Not NaN index
        idx = ~_df[col].isna()
        _df.loc[idx, col] 
            = le.fit(_df.loc[idx, col]).transform(_df.loc[idx, col])
    return _df

@Kerem投票最多的答案有错别字,因此我在这里发布更正和改进的答案:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
for j in a.columns.values:
    le = LabelEncoder()
### fit with the desired col, col in position 0 for this ###example
    fit_by = pd.Series([i for i in a[j].unique() if type(i) == str])
    le.fit(fit_by)
    ### Set transformed col leaving np.NaN as they are
    a["transformed"] = a[j].apply(lambda x: le.transform([x])[0] if type(x) == str else x)

您可以通过将其替换为字符串"NaN"来处理缺失值。该类别可以通过 le.transfrom() 获得。

le.fit_transform(a.fillna('NaN'))
category = le.transform(['NaN'])

另一种解决方案是让标签编码器忽略缺失值。

a = le.fit_transform(a.astype(str))

您可以用某个值填充 na,稍后将数据帧列类型更改为字符串以使事情正常工作。

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
a.fillna(99)
le = LabelEncoder()
le.fit_transform(a.astype(str))

以下编码器解决了每个类别中的"无"值。

class MultiColumnLabelEncoder:
    def __init__(self):
        self.columns = None
        self.led = defaultdict(preprocessing.LabelEncoder)
    def fit(self, X):
        self.columns = X.columns
        for col in self.columns:
            cat = X[col].unique()
            cat = [x if x is not None else "None" for x in cat]
            self.led[col].fit(cat)
        return self
    def fit_transform(self, X):
        if self.columns is None:
            self.fit(X)
        return self.transform(X)
    def transform(self, X):
        return X.apply(lambda x:  self.led[x.name].transform(x.apply(lambda e: e if e is not None else "None")))
    def inverse_transform(self, X):
        return X.apply(lambda x: self.led[x.name].inverse_transform(x))

使用示例

df = pd.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
    'owner': ['Champ', 'Ron', 'Brick', None, 'Veronica', 'Ron'],
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
                 None]
})

print(df)
   location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego      None  monkey
4  San_Diego  Veronica     dog
5       None       Ron     dog
le = MultiColumnLabelEncoder()
le.fit(df)
transformed = le.transform(df)
print(transformed)
   location  owner  pets
0         2      1     0
1         0      3     1
2         0      0     0
3         2      2     2
4         2      4     1
5         1      3     1
inverted = le.inverse_transform(transformed)
print(inverted)
        location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego      None  monkey
4  San_Diego  Veronica     dog
5       None       Ron     dog

此函数从数据帧中获取一列,并返回仅对非 NaN 进行标签编码的列,其余部分保持不变

import pandas as pd
from sklearn.preprocessing import LabelEncoder
def label_encode_column(col):
    nans = col.isnull()
    nan_lst = []
    nan_idx_lst = []
    label_lst = []
    label_idx_lst = []
    for idx, nan in enumerate(nans):
        if nan:
            nan_lst.append(col[idx])
            nan_idx_lst.append(idx)
        else:
            label_lst.append(col[idx])
            label_idx_lst.append(idx)
    nan_df = pd.DataFrame(nan_lst, index=nan_idx_lst)
    label_df = pd.DataFrame(label_lst, index=label_idx_lst) 
    label_encoder = LabelEncoder()
    label_df = label_encoder.fit_transform(label_df.astype(str))
    label_df = pd.DataFrame(label_df, index=label_idx_lst)
    final_col = pd.concat([label_df, nan_df])
    
    return final_col.sort_index()

我就是这样做的:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
UNKNOWN_TOKEN = '<unknown>'
a = pd.Series(['A','B','C', 'D','A'], dtype=str).unique().tolist()
a.append(UNKNOWN_TOKEN)
le = LabelEncoder()
le.fit_transform(a)
embedding_map = dict(zip(le.classes_, le.transform(le.classes_)))

当应用于新的测试数据时:

test_df = test_df.apply(lambda x: x if x in embedding_map else UNKNOWN_TOKEN)
le.transform(test_df)

我还想贡献我的解决方法,因为我发现在处理包含缺失值的分类数据时,其他方法有点乏味

# Create a random dataframe
foo = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Randomly intersperse column 'A' with missing data (NaN)
foo['A'][np.random.randint(0,len(foo), size=20)] = np.nan
# Convert this series to string, to simulate our problem
series = foo['A'].astype(str)
# np.nan are converted to the string "nan", mask these out
mask = (series == "nan")
# Apply the LabelEncoder to the unmasked series, replace the masked series with np.nan
series[~mask] = LabelEncoder().fit_transform(series[~mask])
series[mask] = np.nan
foo['A'] = series
这是我

的尝试!

import numpy as np
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#Now lets encode the incomplete Cabin feature
titanic_train_le['Cabin'] = le.fit_transform(titanic_train_le['Cabin'].astype(str))
#get nan code for the cabin categorical feature
cabin_nan_code=le.transform(['nan'])[0]
#Now, retrieve the nan values in the encoded data
titanic_train_le['Cabin'].replace(cabin_nan_code,np.nan,inplace=True)

我刚刚创建了自己的编码器,它可以一次对数据帧进行编码。使用此类时,None 被编码为 0。尝试制作稀疏矩阵时会很方便。请注意,输入数据帧必须仅包含分类列。

class DF_encoder():
def __init__(self):
    self.mapping = {None : 0}
    self.inverse_mapping = {0 : None}
    self.all_keys =[]
def fit(self,df:pd.DataFrame):
    for col in df.columns:
        keys = list(df[col].unique())
        self.all_keys += keys
    self.all_keys = list(set(self.all_keys))
    for i , item in enumerate(start=1 ,iterable=self.all_keys):
        if item not in self.mapping.keys():
            self.mapping[item] = i
            self.inverse_mapping[i] = item
def transform(self,df):
    temp_df = pd.DataFrame()
    for col in df.columns:
        temp_df[col] = df[col].map(self.mapping)
    return temp_df
    
def inverse_transform(self,df):
    temp_df = pd.DataFrame()
    for col in df.columns:
        temp_df[col] = df[col].map(self.inverse_mapping)
    return temp_df

我遇到了同样的问题,但以上都不适合我。所以我在仅包含"nan"的训练数据中添加了一行新行

相关内容

  • 没有找到相关文章

最新更新