如何将pandas数据帧转换为sklearn one-hot-encoded (dataframe/numpy数组),其中一些列不需要编码?
mydf = pd.DataFrame({'Target':[0,1,0,0,1, 1,1],
'GroupFoo':[1,1,2,2,3,1,2],
'GroupBar':[2,1,1,0,3,1,2],
'GroupBar2':[2,1,1,0,3,1,2],
'SomeOtherShouldBeUnaffected':[2,1,1,0,3,1,2]})
columnsToEncode = ['GroupFoo', 'GroupBar']
是一个已经标签编码的数据帧,我想只编码由columnsToEncode
标记的列?
我的问题是,我不确定pd.Dataframe
或numpy
数组表示是否更好,以及如何重新合并编码部分与另一个。
My attempts so far:
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(X_train)
df = pd.concat([
df[~columnsToEncode], # select all other / numeric
# select category to one-hot encode
pd.Dataframe(encoder.transform(X_train[columnsToEncode]))#.toarray() # not sure what this is for
], axis=1).reindex_axis(X_train.columns, axis=1)
注意:我知道Pandas: Get Dummies/http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html,但这在训练/测试分裂中不起作用,我需要这样的编码。
这个库提供了几个分类编码器,使sklearn/numpy与pandas很好地配合https://github.com/wdm0006/categorical_encoding
但是,它们还不支持"处理未知类别"
现在我将使用
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(df[columnsToEncode])
pd.concat([df.drop(columnsToEncode, 1),
pd.DataFrame(myEncoder.transform(df[columnsToEncode]))], axis=1).reindex()
因为这支持未知数据集。现在,我将继续使用半熊猫半笨蛋,因为熊猫的标签很漂亮。
对于一个热编码,我建议使用ColumnTransformer和OneHotEncoder而不是get_dummies。这是因为OneHotEncoder返回一个对象,该对象可以使用您在训练数据上使用的相同映射来编码看不见的样本。
以下代码对columns_to_encode变量中提供的所有列进行编码:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X: array([[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 100],
[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 200],
[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 300]], dtype=object)
为了避免由于虚拟变量陷阱造成的多重共线性,我还建议删除您编码的每列返回的一列。以下代码对columns_to_encode变量和中提供的所有列进行编码,并删除每个热编码列的最后一列:
import pandas as pd
import numpy as np
def sum_prev (l_in):
l_out = []
l_out.append(l_in[0])
for i in range(len(l_in)-1):
l_out.append(l_out[i] + l_in[i+1])
return [e - 1 for e in l_out]
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
columns_to_encode = [df.iloc[:, del_idx].nunique() for del_idx in columns_to_encode]
columns_to_encode = sum_prev(columns_to_encode)
X = np.array(ct.fit_transform(X))
X = np.delete(X, columns_to_encode, 1)
X: array([[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 100],
[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 200],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300]], dtype=object)
我相信,为了执行虚拟编码,对初始答案的更新甚至更好导入日志
import pandas as pd
from sklearn.base import TransformerMixin
log = logging.getLogger(__name__)
class CategoricalDummyCoder(TransformerMixin):
"""Identifies categorical columns by dtype of object and dummy codes them. Optionally a pandas.DataFrame
can be returned where categories are of pandas.Category dtype and not binarized for better coding strategies
than dummy coding."""
def __init__(self, only_categoricals=False):
self.categorical_variables = []
self.categories_per_column = {}
self.only_categoricals = only_categoricals
def fit(self, X, y):
self.categorical_variables = list(X.select_dtypes(include=['object']).columns)
logging.debug(f'identified the following categorical variables: {self.categorical_variables}')
for col in self.categorical_variables:
self.categories_per_column[col] = X[col].astype('category').cat.categories
logging.debug('fitted categories')
return self
def transform(self, X):
for col in self.categorical_variables:
logging.debug(f'transforming cat col: {col}')
X[col] = pd.Categorical(X[col], categories=self.categories_per_column[col])
if self.only_categoricals:
X[col] = X[col].cat.codes
if not self.only_categoricals:
return pd.get_dummies(X, sparse=True)
else:
return X