我用 keras 和千层面实现了高速公路网络,而 keras 版本的表现一直不如千层面版本。我在两者中使用相同的数据集和元参数。以下是 keras 版本的代码:
X_train, y_train, X_test, y_test, X_all = hacking_script.load_all_data()
data_dim = 144
layer_count = 32
dropout = 0.04
hidden_units = 32
nb_epoch = 10
model = Sequential()
model.add(Dense(hidden_units, input_dim=data_dim))
model.add(Dropout(dropout))
for index in range(layer_count):
model.add(Highway(activation = 'relu'))
model.add(Dropout(dropout))
model.add(Dropout(dropout))
model.add(Dense(2, activation='softmax'))
print 'compiling...'
model.compile(loss='binary_crossentropy', optimizer='adagrad')
model.fit(X_train, y_train, batch_size=100, nb_epoch=nb_epoch,
show_accuracy=True, validation_data=(X_test, y_test), shuffle=True, verbose=0)
predictions = model.predict_proba(X_test)
这是千层面版本的代码:
class MultiplicativeGatingLayer(MergeLayer):
def __init__(self, gate, input1, input2, **kwargs):
incomings = [gate, input1, input2]
super(MultiplicativeGatingLayer, self).__init__(incomings, **kwargs)
assert gate.output_shape == input1.output_shape == input2.output_shape
def get_output_shape_for(self, input_shapes):
return input_shapes[0]
def get_output_for(self, inputs, **kwargs):
return inputs[0] * inputs[1] + (1 - inputs[0]) * inputs[2]
def highway_dense(incoming, Wh=Orthogonal(), bh=Constant(0.0),
Wt=Orthogonal(), bt=Constant(-4.0),
nonlinearity=rectify, **kwargs):
num_inputs = int(np.prod(incoming.output_shape[1:]))
l_h = DenseLayer(incoming, num_units=num_inputs, W=Wh, b=bh, nonlinearity=nonlinearity)
l_t = DenseLayer(incoming, num_units=num_inputs, W=Wt, b=bt, nonlinearity=sigmoid)
return MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)
# ==== Parameters ====
num_features = X_train.shape[1]
epochs = 10
hidden_layers = 32
hidden_units = 32
dropout_p = 0.04
# ==== Defining the neural network shape ====
l_in = InputLayer(shape=(None, num_features))
l_hidden1 = DenseLayer(l_in, num_units=hidden_units)
l_hidden2 = DropoutLayer(l_hidden1, p=dropout_p)
l_current = l_hidden2
for k in range(hidden_layers - 1):
l_current = highway_dense(l_current)
l_current = DropoutLayer(l_current, p=dropout_p)
l_dropout = DropoutLayer(l_current, p=dropout_p)
l_out = DenseLayer(l_dropout, num_units=2, nonlinearity=softmax)
# ==== Neural network definition ====
net1 = NeuralNet(layers=l_out,
update=adadelta, update_rho=0.95, update_learning_rate=1.0,
objective_loss_function=categorical_crossentropy,
train_split=TrainSplit(eval_size=0), verbose=0, max_epochs=1)
net1.fit(X_train, y_train)
predictions = net1.predict_proba(X_test)[:, 1]
现在,keras版本几乎没有优于逻辑回归,而千层面版本是迄今为止最好的评分算法。关于为什么的任何想法?
这里有一些建议(我不确定它们是否真的会缩小你观察到的性能差距):
根据 Keras 文档,高速公路层使用 Glorot 均匀权重初始化,而在 Lasagne 代码中,您使用的是正交权重初始化。除非您有另一部分代码将 Keras 高速公路层的权重初始化设置为正交,否则这可能是性能差距的根源。
看起来您正在将Adagrad用于Keras模型,但是您将Adadelta用于Lasagne模型。
我也不是 100% 确定这一点,但您可能还想验证您的变换偏差项是否以相同的方式初始化。