如何使用Scikit-learn创建一个同时具有数字和1-hot分类特征的训练数据集



我有一个包含连续值和分类值的训练数据集。我已经用scikit学习了一个具有分类特征的训练集(x_train_1hot(,我还有一个具有数值特征的训练集合(x_train_num(。

x_train_num = []
x_test_num = []
x_train_1hot = []
x_test_1hot = []
x_train_full = []
x_test_full = []
cat_feats = []
cat_feats_test = []
for instance in x_train:
num_instance = []
num_instance.append(instance[0])
num_instance.append(instance[2])
num_instance.append(instance[4])
num_instance.append(instance[10])
num_instance.append(instance[11])
num_instance.append(instance[12])
x_train_num.append(num_instance)

cat_instance = []
cat_instance.append(instance[1])
cat_instance.append(instance[3])
cat_instance.append(instance[5])
cat_instance.append(instance[6])
cat_instance.append(instance[7])
cat_instance.append(instance[8])
cat_instance.append(instance[9])
cat_instance.append(instance[13])
cat_feats.append(cat_instance)

for instance in x_test:
num_instance = []
num_instance.append(int(instance[0]))
num_instance.append(int(instance[2]))
num_instance.append(int(instance[4]))
num_instance.append(int(instance[10]))
num_instance.append(int(instance[11]))
num_instance.append(int(instance[12]))
x_test_num.append(num_instance)

cat_instance = []
cat_instance.append(instance[1])
cat_instance.append(instance[3])
cat_instance.append(instance[5])
cat_instance.append(instance[6])
cat_instance.append(instance[7])
cat_instance.append(instance[8])
cat_instance.append(instance[9])
cat_instance.append(instance[13])
cat_feats_test.append(cat_instance)
enc = OneHotEncoder(handle_unknown='ignore')
X = numpy.array(cat_feats)
x_train_1hot = enc.fit_transform(X).toarray()

如何将它们组合成一个完整的训练集(x_train_full(?我曾尝试添加或连接数组,但遇到了一堆错误。我想我从根本上误解了什么?

我想用scikit learn或纯python来做这件事,避免使用panda。

编辑:以下是训练数据集(x_train(的示例:

[['39', ' State-gov', ' 77516', ' Bachelors', ' 13', ' Never-married', ' Adm-clerical', ' Not-in-family', ' White', ' Male', ' 2174', ' 0', ' 40', ' United-States'], ['50', ' Self-emp-not-inc', ' 83311', ' Bachelors', ' 13', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 0', ' 0', ' 13', ' United-States'], ['38', ' Private', ' 215646', ' HS-grad', ' 9', ' Divorced', ' Handlers-cleaners', ' Not-in-family', ' White', ' Male', ' 0', ' 0', ' 40', ' United-States'], ['53', ' Private', ' 234721', ' 11th', ' 7', ' Married-civ-spouse', ' Handlers-cleaners', ' Husband', ' Black', ' Male', ' 0', ' 0', ' 40', ' United-States'], ['28', ' Private', ' 338409', ' Bachelors', ' 13', ' Married-civ-spouse', ' Prof-specialty', ' Wife', ' Black', ' Female', ' 0', ' 0', ' 40', ' Cuba'], ['37', ' Private', ' 284582', ' Masters', ' 14', ' Married-civ-spouse', ' Exec-managerial', ' Wife', ' White', ' Female', ' 0', ' 0', ' 40', ' United-States'], ['49', ' Private', ' 160187', ' 9th', ' 5', ' Married-spouse-absent', ' Other-service', ' Not-in-family', ' Black', ' Female', ' 0', ' 0', ' 16', ' Jamaica'], ['52', ' Self-emp-not-inc', ' 209642', ' HS-grad', ' 9', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 0', ' 0', ' 45', ' United-States'], ['31', ' Private', ' 45781', ' Masters', ' 14', ' Never-married', ' Prof-specialty', ' Not-in-family', ' White', ' Female', ' 14084', ' 0', ' 50', ' United-States'], ['42', ' Private', ' 159449', ' Bachelors', ' 13', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 5178', ' 0', ' 40', ' United-States']]

完整的原始数据集可以在这里找到:http://archive.ics.uci.edu/ml/datasets/Adult

我注意到您没有将x_train_num转换为int。但是你应该能够像这样连接:

x_train_num = np.array(x_train_num, dtype=int)
x_train = np.concatenate([x_train_num, x_train_1hot], axis=1)
print(x_train.shape)
# (10, 33)

最新更新