TensorFlow-Slim Multi-GPU training

我正在使用TensorFlow-Slim。我的目标是在多GPU模式下运行给定的标准脚本(位于/models/slim/scripts中)。我已经测试了finetune_resnet_v1_50_on_flowers.sh脚本(2017 年 4 月 12 日克隆)。我刚刚在训练部分的末尾添加了 --num_clones=2(灵感来自/slim/deployment/model_deploy_test.py 和以前的 StackOverflow 答案)：

python train_image_classifier.py 
--train_dir=${TRAIN_DIR} 
--dataset_name=flowers 
--dataset_split_name=train 
--dataset_dir=${DATASET_DIR} 
--model_name=resnet_v1_50 
--checkpoint_path=${PRETRAINED_CHECKPOINT_DIR}/resnet_v1_50.ckpt 
--checkpoint_exclude_scopes=resnet_v1_50/logits 
--trainable_scopes=resnet_v1_50/logits 
--max_number_of_steps=3000 
--batch_size=32 
--learning_rate=0.01 
--save_interval_secs=60 
--save_summaries_secs=60 
--log_every_n_steps=100 
--optimizer=rmsprop 
--weight_decay=0.00004 
--num_clones=2

来自部署/model_deploy_test.py的代码：

def testMultiGPU(self):
deploy_config = model_deploy.DeploymentConfig(num_clones=2)

我收到一条警告("忽略设备规范")：

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:86:00.0)
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'clone_1/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'clone_0/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0

GPU运行正常(内存使用情况和 GPU 利用率)，但与单个 GPU 训练相比，训练速度并不快。

此问题可能与以下方面有关：https://github.com/tensorflow/tensorflow/issues/8061

我很高兴收到您对这个问题的答复、意见或具体建议。

CUDA 版本：版本 8.0、V8.0.53

从二进制测试版本安装的 TensorFlow：1.0.1 和 1.1.0rc

GPU：英伟达特斯拉P100 (SXM2)

请遵循此文档 https://github.com/tensorflow/tensorflow/issues/12689 为了确保变量存储在 CPU 中，我们需要使用上下文管理器与slim.arg_scope([slim.model_variable, slim.variable], device='/cpu:0'):

它解决了我的问题。

即使这个答案可能迟到了，训练也不应该更快(以每步的秒为单位)。而是创建了另一个模型，导致参数的有效批大小为 64，因此您可以将最大步数减半。

相关内容

最新更新

热门标签：