Tensorflow分布式训练挂起,RecvTensor取消警告



我正在参数服务器分布式模式下使用tensorflow 2.1.0的估计器api训练一个模型。启动培训的代码如下:

# model_fn is defined with model logit
GATNE = tf.estimator.Estimator(model_fn=model_fn, model_dir=args.model_dir, params=model_params, config=config)
# obtain different train and eval files for different workers
# files are stored on shared hdfs
train_files, eval_files = get_files_for_worker()
# the input_fn is going to raises an end-of-input exception after epochs
train_spec = tf.estimator.TrainSpec(
input_fn=lambda: input_fn(train_files, batch_size=args.batch_size, num_epochs=args.epochs),
max_steps=None)
eval_spec = tf.estimator.EvalSpec(
input_fn=lambda: input_fn(eval_files, batch_size=args.batch_size, num_epochs=1),
steps=1000,
exporters=exporter,
throttle_secs=300)
tf.estimator.train_and_evaluate(GATNE, train_spec, eval_spec)
print("Done.")

我使用2ps服务器和5个工作人员(索引范围从0到4(来训练模型。工作人员0是主管,而工作人员1是评估人员,其余人员是普通工作人员。

培训任务启动后,工人4成功完成了培训任务,以下是该工人的一些日志(我在培训期间添加了一些日志信息,如******************* [2021-04-01 12:05] global_step: 4331 *************************,以跟踪步骤(:

..............
2021-04-01 12:05:13.483 INFO:tensorflow:******************* [2021-04-01 12:05] global_step: 4331 *************************
2021-04-01 12:05:55.315 INFO:tensorflow:******************* [2021-04-01 12:05] global_step: 4535 *************************
2021-04-01 12:06:40.621 INFO:tensorflow:******************* [2021-04-01 12:06] global_step: 4733 *************************
2021-04-01 12:07:30.341 INFO:tensorflow:******************* [2021-04-01 12:07] global_step: 4936 *************************
2021-04-01 12:08:16.981 INFO:tensorflow:loss = 59.152977, step = 5136 (434.022 sec)
2021-04-01 12:08:16.981 INFO:tensorflow:******************* [2021-04-01 12:08] global_step: 5136 *************************
2021-04-01 12:09:04.206 INFO:tensorflow:******************* [2021-04-01 12:09] global_step: 5337 *************************
2021-04-01 12:09:51.85 INFO:tensorflow:******************* [2021-04-01 12:09] global_step: 5537 *************************
2021-04-01 12:10:38.679 INFO:tensorflow:******************* [2021-04-01 12:10] global_step: 5742 *************************
2021-04-01 12:11:25.781 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 5948 *************************
2021-04-01 12:12:07.822 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6123 *************************
2021-04-01 12:12:54.258 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6327 *************************
2021-04-01 12:13:41.683 INFO:tensorflow:******************* [2021-04-01 12:13] global_step: 6528 *************************
2021-04-01 12:14:21.60 INFO:tensorflow:Loss for final step: 62.46709.
2021-04-01 12:14:21.60 Done.

然而,所有其他工作人员都会收到警告:W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842并永远挂起。

工作人员2的一些日志:

......
2021-04-01 12:11:14.380 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 5899 *************************
2021-04-01 12:11:56.730 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 6077 *************************
2021-04-01 12:12:39.956 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6269 *************************
2021-04-01 12:13:26.957 INFO:tensorflow:******************* [2021-04-01 12:13] global_step: 6468 *************************
2021-04-01 12:14:14.77 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6668 *************************
2021-04-01 12:14:50.611 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6821 *************************
2021-04-01 12:15:19.129 2021-04-01 12:15:19.128988: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.129 2021-04-01 12:15:19.129495: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132203: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132261: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132339: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132388: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842

工人3:

2021-04-01 12:14:22.586 INFO:tensorflow:loss = 60.513123, step = 6703 (462.406 sec)
2021-04-01 12:14:22.586 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6703 *************************
2021-04-01 12:14:58.652 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6855 *************************
2021-04-01 12:15:30.105 INFO:tensorflow:******************* [2021-04-01 12:15] global_step: 6984 *************************
2021-04-01 12:15:53.527 INFO:tensorflow:******************* [2021-04-01 12:15] global_step: 7084 *************************
2021-04-01 12:16:17.568 INFO:tensorflow:******************* [2021-04-01 12:16] global_step: 7184 *************************
2021-04-01 12:16:41.582 INFO:tensorflow:******************* [2021-04-01 12:16] global_step: 7284 *************************
2021-04-01 12:17:05.140 INFO:tensorflow:******************* [2021-04-01 12:17] global_step: 7384 *************************
2021-04-01 12:17:22.976 INFO:tensorflow:******************* [2021-04-01 12:17] global_step: 7485 *************************
2021-04-01 12:17:36.575 2021-04-01 12:17:36.574901: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363

这些工人只是吊着,我不知道为什么。从ps服务器日志中,也有一些类似的警告:

ps服务器0:

2021-04-01 11:48:01.233 2021-04-01 11:48:01.232972: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
2021-04-01 12:14:20.957 2021-04-01 12:14:20.957449: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.133 2021-04-01 12:15:19.132949: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133103: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133120: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133131: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133141: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133151: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133174: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:35.837 2021-04-01 12:17:35.836897: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577149: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577221: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577233: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577312: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577323: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577330: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577339: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577348: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577355: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577363: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577369: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.579 2021-04-01 12:17:36.579436: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.

ps服务器1:

2021-04-01 11:46:28.368 2021-04-01 11:46:28.368034: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
2021-04-01 12:14:20.955 2021-04-01 12:14:20.955100: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 42171233248859519
2021-04-01 12:14:20.958 2021-04-01 12:14:20.957943: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130513: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130639: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130660: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130680: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134413: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134734: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134767: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134853: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134870: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134908: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134976: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134993: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135043: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135084: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135120: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135133: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135177: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135212: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:17:35.837 2021-04-01 12:17:35.837269: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578073: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578184: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578216: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.580 2021-04-01 12:17:36.580204: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.

主管和评估员的日志看起来不错。我在谷歌上搜索了RecvTensor cancelled警告,但找不到足够的信息。也许这是网络问题?但每次我启动培训任务时,这个警告总是会出现,工人们都会被绞死。所以我想这不是偶然的。

我试过几次,发现总会有一个工人成功完成任务,然后其他工人就会挂断。

有人能帮我吗?

我运行了几次任务,发现一名培训人员将首先完成任务,然后所有其他培训人员都开始挂起了。所以我想,完成培训的工人对其他工人有一些影响。不如让第一个完成培训的工人不要";完成了";?因此,我在培训代码后为每个培训工人添加了10分钟的睡眠时间,以使工人不会"睡懒觉";完成";。这起到了作用。当第一个完成的工人正在睡觉时,其他培训工人继续进行适当的培训,所有培训都在10分钟内完成。到目前为止,这是我的解决方案。仍对其他解决方案持开放态度。

最新更新