如何在python中保留训练数据中的列用于预测

我有一个数据集，看起来像下面:

| Amount   | Source | y |
| -------- | ------ | - |
| 285      | a      | 1 |
| 556      | b      | 0 | 
| 883      | c      | 0 |
| 156      | c      | 1 |
| 374      | a      | 1 |
| 1520     | d      | 0 |

'Source'是分类变量。这个字段中的类别是"a"、"b"、"c"one_answers"d"。所以一个热编码列是'source_a'， 'source_b'， 'source_c'和'source_d'。我用这个模型来预测y的值。用于预测的新数据不包含训练中使用的所有类别。它只有"a"、"c"one_answers"d"三个类别。当我对这个数据集进行热编码时，它缺少列'source_b'。如何将这些数据转换为训练数据?

PS:我使用XGBClassifier()进行预测。

使用相同的编码器实例。假设您选择了sklearn的一个热编码器，您所要做的就是将其导出为pickle，以便在需要时使用它进行推理。

from sklearn.preprocessing import OneHotEncoder
import pickle
# blah blah blah
enc = OneHotEncoder(handle_unknown='ignore')
#assume X_train = the source column
X_train = enc.fit_transform(X_train)
pickle.dump(enc, open('onehot.pickle', 'wb'))

然后加载它进行推理:

import pickle
loaded_enc = pickle.load(open("onehot.pickle", "rb"))

那么你所要做的就是点击:

#X_test is the source column of your test data
X_test = loaded_enc.transform(X_test)

一般来说，在您将编码器与X_train匹配之后，您所要做的就是简单地转换测试集。所以

X_test = loaded_enc.transform(X_test)

明确地写下来:

import pandas as pd
import numpy as np
# an example of your dataframe with no "b" source
df = pd.DataFrame({
"Amount" : [int(i) for i in np.random.normal(800,300, 10)],
"Source" : np.random.choice(["a", "c", "d"], 10),
"y"      : np.random.choice([1,0], 10)
})
# One Hot Encoding
df["source_a"] = np.where(df.Source == "a",1,0)
df["source_b"] = np.where(df.Source == "b",1,0)
df["source_c"] = np.where(df.Source == "c",1,0)
df["source_d"] = np.where(df.Source == "d",1,0)

数据帧的输出:

Amount Source  y  source_a  source_b  source_c  source_d
0     685      d  0         0         0         0         1
1    1149      c  1         0         0         1         0
2    1220      a  0         1         0         0         0
3     834      c  0         0         0         1         0
4     780      c  0         0         0         1         0
5     502      a  0         1         0         0         0
6     191      c  1         0         0         1         0
7     637      c  0         0         0         1         0
8     701      d  0         0         0         0         1
9     941      c  1         0         0         1         0

对于一般规则依赖必须最小化…

相关内容

最新更新

热门标签：