scikit-learn:是否有一种方法可以提供一个对象作为预测分类器功能的输入

我计划在生产中使用SGDClassifier。这个想法是在一些训练数据上训练分类器，使用cPickle将其转储到.pkl文件中，然后在脚本中重用它。然而，有一些高基数的字段本质上是分类的，并被转换成一个热矩阵表示，它创建了大约5000个特征。现在我得到的预测的输入只会有这些特征中的一个其余的都是0。当然，它还包括其他的数值特征。从文档中可以看出，predict函数需要一个数组的数组作为输入。是否有任何方法可以将我的输入转换为predict函数所期望的格式，而不必在每次训练模型时存储字段?

更新

那么，假设我的输入包含3个字段:

{
  rate: 10, // numeric
  flagged: 0, //binary 
  host: 'somehost.com' // keeping this categorical
}

host可以有大约5000个不同的值。现在我将文件加载到pandas数据框架中，使用get_dummies函数将主机字段转换为大约5000个新字段，这些字段是二进制字段。

然后通过模型进行训练，并使用cPickle进行存储。

现在，当我需要使用predict函数时，对于输入，我只有3个字段(如上所示)。然而，根据我的理解，预测端点将期望一个向量数组，每个向量应该有5000个字段。

对于我需要预测的条目，我只知道该条目的一个字段，该字段将是host本身的值。

例如，如果我的输入是

{
  rate: 5,
  flagged: 1
  host: 'new_host.com'
}

我知道预测器期望的字段应该是:

{
  rate: 5,
  flagged: 1
  new_host: 1
}

但是如果我将其转换为矢量格式，我不知道将new_host字段放在哪个索引中。此外，我也不知道其他主机是什么(除非我在训练阶段将其存储在某个地方)

我希望我说得有道理。如果我做错了，请告诉我。

我不知道把new_host字段放在哪个索引

对我来说有效的一个好方法是建立一个管道，然后用于训练和预测。这样，您就不必关心转换产生的输出的列索引:

# in training 
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
                ('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)

作为X, Y的输入，需要传递一个类似数组的对象。确保训练和测试的输入结构相同，例如

X = [[data.get('rate'), data.get('flagged'), data.get('host')]] 
Y = [[y-cols]] # your example doesn't specify what is Y in your data

更灵活:Pandas DataFrame + Pipeline

还可以很好地将Pandas DataFrame与sklearn-pandas结合使用，因为它允许您对不同的列名使用不同的转换。例如

df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
        ('host', sklearn.preprocessing.LabelBinarizer()),
        ('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper), 
                       ('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()

请注意，x-cols和y-col(s)分别是功能列和目标列的列表。

您应该使用scikit-learn转换器而不是get_dummies。在这种情况下，LabelBinarizer是有意义的。鉴于LabelBinarizer不能在管道中工作，这是您想要做的一种方法:

binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
                          one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})

then at prediction time:

estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
                         one_hot_data], axis=1)
clf.predict(X_test)

相关内容

最新更新

热门标签：