分配张量时的OOM

我如何解决Resource exhaustedError的问题：OOM分配张量？

resource exhaustedError（有关跟踪的参见）：分配时OOM 张量具有形状[10000,32,28,28]

我几乎包括所有代码

learning_rate = 0.0001
epochs = 10
batch_size = 50
# declare the training data placeholders
# input x - for 28 x 28 pixels = 784 - this is the flattened image data that is drawn from
# mnist.train.nextbatch()
x = tf.placeholder(tf.float32, [None, 784])
# dynamically reshape the input
x_shaped = tf.reshape(x, [-1, 28, 28, 1])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])
def create_new_conv_layer(input_data, num_input_channels, num_filters, filter_shape, pool_shape, name):
    # setup the filter input shape for tf.nn.conv_2d
    conv_filt_shape = [filter_shape[0], filter_shape[1], num_input_channels,
                      num_filters]
    # initialise weights and bias for the filter
    weights = tf.Variable(tf.truncated_normal(conv_filt_shape, stddev=0.03),
                                      name=name+'_W')
    bias = tf.Variable(tf.truncated_normal([num_filters]), name=name+'_b')
    # setup the convolutional layer operation
    out_layer = tf.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')
    # add the bias
    out_layer += bias
    # apply a ReLU non-linear activation
    out_layer = tf.nn.relu(out_layer)
    # now perform max pooling
    ksize = [1, 2, 2, 1]
    strides = [1, 2, 2, 1]
    out_layer = tf.nn.max_pool(out_layer, ksize=ksize, strides=strides,
                               padding='SAME')
    return out_layer
# create some convolutional layers
layer1 = create_new_conv_layer(x_shaped, 1, 32, [5, 5], [2, 2], name='layer1')
layer2 = create_new_conv_layer(layer1, 32, 64, [5, 5], [2, 2], name='layer2')
flattened = tf.reshape(layer2, [-1, 7 * 7 * 64])
# setup some weights and bias values for this layer, then activate with ReLU
wd1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1000], stddev=0.03), name='wd1')
bd1 = tf.Variable(tf.truncated_normal([1000], stddev=0.01), name='bd1')
dense_layer1 = tf.matmul(flattened, wd1) + bd1
dense_layer1 = tf.nn.relu(dense_layer1)
# another layer with softmax activations
wd2 = tf.Variable(tf.truncated_normal([1000, 10], stddev=0.03), name='wd2')
bd2 = tf.Variable(tf.truncated_normal([10], stddev=0.01), name='bd2')
dense_layer2 = tf.matmul(dense_layer1, wd2) + bd2
y_ = tf.nn.softmax(dense_layer2)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=dense_layer2, labels=y))

# add an optimiser
optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# setup the initialisation operator
init_op = tf.global_variables_initializer() 

 with tf.Session() as sess:
            # initialise the variables
            sess.run(init_op)
            total_batch = int(len(mnist.train.labels) / batch_size)
            for epoch in range(epochs):
                avg_cost = 0
                for i in range(total_batch):
                    batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
                    _, c = sess.run([optimiser, cross_entropy], feed_dict={x: 
         batch_x, 
            y: batch_y})
                    avg_cost += c / total_batch
                test_acc = sess.run(accuracy,feed_dict={x: mnist.test.images, y: 
            mnist.test.labels})
                print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost), "  
            test accuracy: {:.3f}".format(test_acc))
            print("nTraining complete!")
            print(sess.run(accuracy, feed_dict={x: mnist.test.images, y: 
            mnist.test.labels}))

，错误中引用的那些行是： create_new_conv_layer - function

sess.run ..在训练循环中

我从辩论者输出中复制的更多错误如下（有更多的行，但我认为这些是主要的，而其他的是由此引起的。）

tensorflow.python.framework.errors_impl.resourceexhaustederror：用形状分配张量时的oom [10000,32,28,28] conv2d [t = dt_float，data_format =" nhwc"，padding =" same"，prines = [1，1， 1，1]，use_cudnn_on_gpu = true， _device ="/job：localhost/replica：0/任务：0/gpu：0"]

我第二次运行时会发出以下错误，我同时具有CPU和GPU，如下输出所示，我可以理解与CPU问题有关的某些错误可能是因为我的TensorFlow并未汇编来使用这些错误来使用这些错误功能，我安装了Cuda 8和Cudnn 6，Python 3.5，TensorFlow 1.3.0 Windows 10。

2017-10-03 03：53：58.944371：W C： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow core core platform cpu_feature_guard.cc：45] TensorFlow库未编译以使用AVX说明，但是这些可在您的机器上可用，可以加快CPU加速计算。2017-10-03 03：53：58.945563：W C： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow core core platform cpu_feature_guard.cc：45] TensorFlow库未编译以使用AVX2说明，但是这些可在您的机器上可用，可以加快CPU加速计算。2017-10-03 03：53：59.230761：I c： tf_jenkins home workspace rel-win m windows-gpu py py py 35 tensorflow core councom_runtime gpu gpu gpu_device.cc：955] 找到具有属性的设备0： 名称：Quadro K620专业：5次要：0 MemoryClockrate（GHz）1.124 PCIBUSID 0000：01：00.0总内存：2.00GIB免费内存：1.66GIB > 2017-10-03 03：53：59.231109：I C： tf_jenkins home workspace rel-win m windows-gpu py py py 35 tensorflow core comen_runtime gpu gpu gpu_device.cc：976] DMA：0 2017-10-03 03：53：59.231229：I c： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow core councom common_runtime gpu gpu gpu_device.cc：986] 0：Y 2017-10-03 03：53：59.231363：i C： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow core councom common_runtime gpu gpu gpu_device.cc：1045] 创建TensorFlow设备（/GPU：0） ->（设备：0，名称：Quadro K620， PCI总线ID：0000：01：00.0）2017-10-03 03：54：01.511141：E C： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow stream_executor cuda cuda cuda cuda_dnn.cc：371] 无法创建cudnn句柄：cudnn_status_not_initialized 2017-10-03 03：54：01.511372：e c： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow stream_executor cuda cuda cuda_dnn.cc：375] 错误检索驱动程序版本：未完成：内核报告驱动程序版本未在Windows上实现 2017-10-03 03：54：01.511862：E c： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow stream_executor cuda cuda cuda_dnn.cc：338] 无法销毁Cudnn手柄：Cudnn_status_bad_param 2017-10-03 03：54：01.512074：F C： tf_jenkins home workspace rel-win m windows-gpu py py 35 tensorflow core kernels kernels conv_ops.cc：672] 检查失败：stream-> parent（） -> getConvolvealgorithm（ conv_parameters.shouldincludewinogradnonfusedalgo（），算法）

该过程失败了，因为您立即推动了整个测试集以进行评估（请参阅此问题）。很容易看出10000 * 32 * 28 * 28 * 4几乎是1GB，而您的GPU总共只有1.66GB，大部分已经由网络本身采取。

解决方案是为神经网络批处理提供不仅用于训练，而且还用于测试。结果准确性将是所有批次的平均值。此外，您无需在每个时期之后执行此操作：您是否对所有中间网络的测试结果真的很感兴趣？

您的第二个错误消息很可能是先前失败的结果，因为Cudnn驱动程序似乎不再起作用。我建议重新启动您的机器。

相关内容

最新更新

热门标签：