断言错误:"断言不_is_device_list_single_worker(设备)",同时使用 T



我正在 gcp ai 平台上运行一个训练作业,用于具有--python-version 3.7--runtime-version 2.1镜像分布策略的张量流估计器。

我在下面提供了必要的代码片段:

SESS_CONFIG = tf.compat.v1.ConfigProto(
allow_soft_placement=True,
log_device_placement=False,
intra_op_parallelism_threads=0,
gpu_options=tf.compat.v1.GPUOptions(force_gpu_compatible=True))
config = tf.estimator.RunConfig(save_summary_steps=10,
save_checkpoints_steps=20,
session_config=SESS_CONFIG,
keep_checkpoint_max=5,
log_step_count_steps=100,
train_distribute=tf.distribute.MirroredStrategy(), # Distribution Strategy
eval_distribute=tf.distribute.MirroredStrategy(),  # Distribution Strategy
experimental_max_worker_delay_secs=None)
# -----------
custom_estimator_model = tf.estimator.Estimator(
model_fn=model_fn(), model_dir=model_dir,
config=config)
train_spec = tf.estimator.TrainSpec(input_fn=input_fn,
max_steps=train_steps)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn,
steps=eval_steps,
exporters=exporters,
throttle_secs=eval_throttle_secs)
tf.estimator.train_and_evaluate(custom_estimator_model,
train_spec,
eval_spec)

配置:config.yaml使用:

trainingInput:
masterType: complex_model_m_gpu
scaleTier: CUSTOM

该代码ai-platform 上使用 tensorflow 1.14 和 Python 3.5 工作,并在RunConfig()中以train_distribute=tf.contrib.distribute.MirroredStrategy()的形式提供了该策略。但是在TF2升级后,它被更改为train_distribute=tf.distribute.MirroredStrategy()。 更改后,错误为:

错误:

Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 239, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 235, in main
model_dir=model_dir)
File "/root/.local/lib/python3.7/site-packages/trainer/models/models.py", line 244, in train_from_scratch
self.train_estimator(model_dir)
File "/root/.local/lib/python3.7/site-packages/trainer/models/models.py", line 234, in train_estimator
eval_spec)
File "/root/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 463, in train_and_evaluate
_TrainingExecutor)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 290, in train_and_evaluate
session_config=run_config.session_config)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 836, in run_distribute_coordinator
task_type, task_id)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 548, in _configure_session_config_for_std_servers
task_id=task_id)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1127, in configure
session_config, cluster_spec, task_type, task_id)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 788, in _configure
self._initialize_multi_worker(multi_worker_devices)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 510, in _initialize_multi_worker
device_dict = _group_device_list(devices)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 265, in _group_device_list
assert not _is_device_list_single_worker(devices)
AssertionError

当我使用镜像策略从运行时 1.5 升级到 2.3 时,我也遇到了这个问题。

似乎主站的默认配置已更改(请参阅主节点与主站(,这会导致 tf 在尝试使用单个工作器配置连接器时失败并出现此错误。 对我们来说,解决方案是取消设置TF_CONFIG(del os.environ["TF_CONFIG"](,这导致它回退到之前的行为。

最新更新