OneHotEncoder条带标头



我试图在泰坦数据集中制作一个ML模型,在准备它的时候,我使用OneHotEncoder制作了开始的假人,而这样做的时候,我失去了我的列标题。这是数据集之前的样子。

Pclass  Sex Age SibSp   Parch   Fare    Cabin   Embarked
0   3   1   22.000000   1   0   7.2500  146 2
1   1   0   38.000000   1   0   71.2833 81  0
2   3   0   26.000000   0   0   7.9250  146 2
3   1   0   35.000000   1   0   53.1000 55  2
4   3   1   35.000000   0   0   8.0500  146 2
... ... ... ... ... ... ... ... ...
886 2   1   27.000000   0   0   13.0000 146 2
887 1   0   19.000000   0   0   30.0000 30  2
888 3   0   29.699118   1   2   23.4500 146 2
889 1   1   26.000000   0   0   30.0000 60  0
890 3   1   32.000000   0   0   7.7500  146 1

代码如下:

ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X))
X

数据集现在的样子。

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
0   1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 22.000000   1.0 7.2500  146.0
1   0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 38.000000   1.0 71.2833 81.0
2   1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 26.000000   0.0 7.9250  146.0
3   1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 35.000000   1.0 53.1000 55.0
4   1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 35.000000   0.0 8.0500  146.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 27.000000   0.0 13.0000 146.0
887 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 19.000000   0.0 30.0000 30.0
888 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 29.699118   1.0 23.4500 146.0
889 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 26.000000   0.0 30.0000 60.0
890 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 3.0 1.0 32.000000   0.0 7.7500  146.0

您可以使用ColumnTransformerget_feature_names方法,前提是您的所有变压器都支持该方法,并且您已经在数据框架上进行了训练。

ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X), columns=ct.get_feature_names())
X

fit_transform的输出是array like
X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components)
(不是dataframelike)

因此没有标头。如果需要头文件,则必须在重新构建DataFrame时命名它们。

最新更新