我正在尝试通过检查点保存变量,以将容错引入我的程序。我正在尝试通过使用监控训练会话功能来实现这一点。以下是我的配置:-
import tensorflow as tf
global_step = tf.Variable(10, trainable=False, name='global_step')
x = tf.constant(2)
with tf.device("/job:local/task:0"):
y1 = tf.Variable(x + 300)
with tf.device("/job:local/task:1"):
y2 = tf.Variable(x**2)
with tf.device("/job:local/task:2"):
y3 = tf.Variable(5*x)
with tf.device("/job:local/task:3"):
y0 = tf.Variable(x - 66)
y = y0 + y1 + y2 + y3
model = tf.global_variables_initializer()
saver = tf.train.Saver(sharded=True)
chief = tf.train.ChiefSessionCreator(scaffold=None, master='grpc://localhost:2222', config=None, checkpoint_dir='/home/tensorflow/codes/checkpoints')
summary_hook = tf.train.SummarySaverHook(save_steps=None, save_secs=10, output_dir='/home/tensorflow/codes/savepoints', summary_writer=None, scaffold=None, summary_op=tf.summary.tensor_summary(name="y", tensor=y))
saver_hook = tf.train.CheckpointSaverHook(checkpoint_dir='/home/tensorflow/codes/checkpoints', save_secs=None, save_steps=True, saver=saver, checkpoint_basename='model.ckpt', scaffold=None)
# with tf.train.MonitoredSession(session_creator=ChiefSessionCreator,hooks=[saver_hook, summary_hook]) as sess:
with tf.train.MonitoredTrainingSession(master='grpc://localhost:2222', is_chief=True, checkpoint_dir='/home/tensorflow/codes/checkpoints',
scaffold=None, hooks=[saver_hook,summary_hook], chief_only_hooks=None, save_checkpoint_secs=None, save_summaries_steps=True, config=None) as sess:
while not sess.should_stop():
sess.run(tf.global_variables_initializer())
while not sess.should_stop():
result = sess.run(y)
print(result)
我收到以下运行时错误,我无法解决:-
Traceback (most recent call last):
File "add_1.py", line 39, in <module>
sess.run(tf.global_variables_initializer())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1187, in global_variables_initializer
return variables_initializer(global_variables())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1169, in variables_initializer
return control_flow_ops.group(*[v.initializer for v in var_list], name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2773, in group
deps.append(_GroupControlDeps(dev, ops_on_device[dev]))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2721, in _GroupControlDeps
return no_op(name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_control_flow_ops.py", line 186, in no_op
result = _op_def_lib.apply_op("NoOp", name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2199, in create_op
self._check_not_finalized()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1925, in _check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.
错误的根本原因似乎是 MonitoredTrainingSession 已经完成(冻结)了图形,并且您的tf.global_variable_initializer()
无法再修改它。
话虽如此,有很多事情需要注意:
1)为什么你尝试在这里重复初始化所有变量?
while not sess.should_stop():
sess.run(tf.global_variables_initializer())
2)似乎您的某些代码已经包含在MonitoredTrainingSession
中,例如 ChiefSessionCreator
.您能否再看一下代码(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/monitored_session.py#L243)或搜索其示例用法,看看应该如何使用MonitoredTrainingSession
?
对于您的用例,可能不建议这样做,但可以取消最终确定图形:
sess.graph._unsafe_unfinalize()
如果要在循环中初始化图形,可以使用该函数在循环之上创建新图形。
import tensorflow as tf
tf.reset_default_graph()
tf.Graph().as_default()
由于您的目标是使用 MonitoredTrainingSession
来获取检查点,因此用法比您的示例简单得多:
import tensorflow as tf
global_step = tf.contrib.framework.get_or_create_global_step()
x = tf.constant(2)
y1 = x + 300
y2 = x**2
y3 = x * 5
y0 = x - 66
y = y0 + y1 + y2 + y3
step = tf.assign_add(global_step, 1)
with tf.train.MonitoredTrainingSession(checkpoint_dir='/tmp/checkpoints') as sess:
while not sess.should_stop():
result, i = sess.run([y, step])
print(result, i)
- 用于保存/恢复检查点的钩子由
MonitoredTrainingSession
为您创建。 - 如果传入
save_checkpoint_secs
则可以更改检查点的频率,默认值为 10 分钟。我发现更高的频率是不值得的:保存检查点不是免费的,所以非常频繁的检查点最终会减慢训练速度。 - 只有分布式运行才需要
ChiefSessionCreator
和 gRPC 配置(有关概念的说明,请参阅此处)。与将操作分配给特定设备类似 - 确保在使用之前确实需要执行此操作,因为如果您不小心,它可能会减慢速度。 - 你不需要用
tf.Variable()
包装张量上的运算结果 - 它们已经是变量。 - 你可以传递
save_summaries_steps
来监控使用张量板的训练,但默认情况下,无论如何每 100 步就会发生一次。