在分布式错误中使用 grpc+mpi 协议

我刚刚编译了支持MPI的TensorFlow（master），现在在tf.train.Server对象中指定了"grpc+mpi"协议。但是，在尝试启动培训过程时，始终只有一个工作人员因错误而失败

F ./tensorflow/contrib/mpi/mpi_utils.h:47] Failed to convert worker name to MPI index: ps:0:0

每次我重现错误时，都是不同的工人无法"转换"。考虑到它实际上无法"转换"参数服务器的属性，它无法转换的名称是一个"worker"名称，这对我来说非常可疑。

当使用"标准"协议"grpc"时，整个训练过程工作正常。

每个工作线程以及单参数服务器在专用计算机（无共享计算机）上运行。OpenMPI 版本是 2.1.1

我将如何调试它？不幸的是，我对 MPI 知之甚少。

谢谢

席

当我使用支持MPI的TensorFlow时，我遇到了同样的问题。原因是我没有使用 mpirun 来启动训练过程。

例如，我的训练脚本是mpi_train.sh：

#! /bin/bash
host=$(hostname -s)
if [[ $host = "node-1" ]]; then
        job_name=ps
        task_id=0
elif [[ $host = "node-2" ]]; then
        job_name=worker
        task_id=0
elif [[ $host = "node-3" ]]; then
       job_name=worker
       task_id=1
fi
cd /test/models/inception
bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=/test/data/ILSVRC2012 --job_name=${job_name} --task_id=${task_id} --ps_hosts=10.0.20.14:2276 --worker_hosts=10.0.20.15:2276,10.0.20.16:2276 --protocol=grpc+mpi --max_steps=1020

我应该使用 mpirun 来启动我的训练脚本：

mpirun -host 10.0.0.14,10.0.0.15,10.0.0.16 /test/mpi_train.sh

相关内容

最新更新

热门标签：