TensorFlow on multiple GPU



最近,我试图通过阅读官方教程来学习如何在多个GPU上使用Tensorflow。然而,有一点我感到困惑。以下代码是官方教程的一部分,用于计算单个GPU的损失。

def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss

培训过程如下。

def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

然而,我对"for I in xrange(FLAGS.num_gpus("的for循环感到困惑。似乎我必须从batch_queue中获得一个新的批处理图像,并计算每个梯度。我认为这个过程是序列化的,而不是并行的。如果我自己的理解有什么问题?顺便说一下,我还可以使用迭代器将图像馈送到我的模型,而不是出列,对吧?

谢谢大家!

这是Tensorflow编码模型中常见的误解。这里显示的是计算图的构造,而不是实际执行。

区块:

for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)

翻译为:

For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss

注意如何通过tf.get_variable_scope().reuse_variables()告诉图形,图形GPU使用的变量必须在所有图形之间共享(即,多个设备上的所有图形"重用"相同的变量(。

所有这些实际上都不会运行网络一次(注意没有sess.run()(:您只是给出了数据必须如何流动的指示。

然后,当你开始实际训练时(我想你在这里复制代码时错过了这段代码(,每个GPU都会提取自己的批次,并产生每个塔的损失。我想这些损失在后续代码中的某个地方是平均值,平均值是传递给优化器的损失。

在将塔损失平均在一起之前,所有设备都独立于其他设备,因此可以并行获取批次和计算损失。然后只进行一次梯度和参数更新,更新变量并重复循环。

因此,为了回答您的问题,no,每批损失计算不是串行化的,但由于这是同步分布式计算,您需要从所有GPU收集所有损失,然后才能继续进行梯度计算和参数更新,因此图中仍有一些部分不能独立。

最新更新