如何在sklearn中同时对多个列应用预处理方法



我的问题是我的pandas数据框架中有这么多列,我试图使用sklearn-pandas库中的数据框架映射器应用sklearn预处理,例如

mapper= DataFrameMapper([
    ('gender',sklearn.preprocessing.LabelBinarizer()),
    ('gradelevel',sklearn.preprocessing.LabelEncoder()),
    ('subject',sklearn.preprocessing.LabelEncoder()),
    ('districtid',sklearn.preprocessing.LabelEncoder()),
    ('sbmRate',sklearn.preprocessing.StandardScaler()),
    ('pRate',sklearn.preprocessing.StandardScaler()),
    ('assn1',sklearn.preprocessing.StandardScaler()),
    ('assn2',sklearn.preprocessing.StandardScaler()),
    ('assn3',sklearn.preprocessing.StandardScaler()),
    ('assn4',sklearn.preprocessing.StandardScaler()),
    ('assn5',sklearn.preprocessing.StandardScaler()),
    ('attd1',sklearn.preprocessing.StandardScaler()),
    ('attd2',sklearn.preprocessing.StandardScaler()),
    ('attd3',sklearn.preprocessing.StandardScaler()),
    ('attd4',sklearn.preprocessing.StandardScaler()),
    ('attd5',sklearn.preprocessing.StandardScaler()),
    ('sbm1',sklearn.preprocessing.StandardScaler()),
    ('sbm2',sklearn.preprocessing.StandardScaler()),
    ('sbm3',sklearn.preprocessing.StandardScaler()),
    ('sbm4',sklearn.preprocessing.StandardScaler()),
    ('sbm5',sklearn.preprocessing.StandardScaler())
 ])

我只是想知道是否有另一种更简洁的方法可以让我一次预处理许多变量,而无需显式地将它们写出来。

另一件我发现有点烦人的事情是,当我将所有的pandas数据框转换成sklearn可以使用的数组时,它们将失去列名特征,这使得选择非常困难。有人知道如何在将pandas数据帧更改为np数组时保留列名作为键吗?

非常感谢!

from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper
encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
    [('gender', LabelBinarizer())] +
    [(encoder, LabelEncoder()) for encoder in encoders] +
    [(scalar, StandardScaler()) for scalar in scalars]
)

如果你经常这样做,你甚至可以写你自己的函数:

mapper = data_frame_mapper(binarizers=['gender'],
    encoders=['gradelevel', 'subject', 'districtid'],
    scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])

相关内容

  • 没有找到相关文章

最新更新