信息:张量流:等待检查点超时



我在带有旧GPU的MacBook Pro上运行(NVIDIA GeForce 9600M GT 512 MB(,在OS X 10.11.6上装有CUDA 4.5。 (Tensorflow 需要 CUDA 7.5 或更高版本才能使用 GPU(。

我在 Tensorflow 中训练洋红色模型时遇到此错误:

信息:张量流:等待检查点超时。

这是我的命令和输出。

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" --num_training_steps=20000 --eval
INFO: Found 1 target...
Target //magenta/models/melody_rnn:melody_rnn_train up-to-date:
bazel-bin/magenta/models/melody_rnn/melody_rnn_train
INFO: Elapsed time: 0.561s, Critical Path: 0.09s
INFO: Running command line: bazel-bin/magenta/models/melody_rnn/melody_rnn_train '--config=attention_rnn' '--run_dir=/tmp/melody_rnn/logdir/run1' '--sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord' '--hparams=batch_size=10,rnn_layer_sizes=[64,64]' '--num_training_steps=20000' --eval
INFO:tensorflow:hparams = {'rnn_layer_sizes': [64, 64], 'attn_length': 40, 'dropout_keep_prob': 0.5, 'batch_size': 10, 'clip_norm': 3, 'learning_rate': 0.001}
INFO:tensorflow:[<tf.Tensor 'ParseSingleSequenceExample/ParseSingleSequenceExample:0' shape=(?, 74) dtype=float32>, <tf.Tensor 'ParseSingleSequenceExample/ParseSingleSequenceExample:1' shape=(?,) dtype=int64>, <tf.Tensor 'strided_slice:0' shape=() dtype=int32>]
INFO:tensorflow:Train dir: /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Eval dir: /tmp/melody_rnn/logdir/run1/eval
INFO:tensorflow:Counting records in /Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord.
INFO:tensorflow:Total records: 46
INFO:tensorflow:Waiting for new checkpoint at /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Timed-out waiting for a checkpoint.
David-Laxers-MacBook-Pro:magenta davidlaxer$ 

此错误的原因是什么?

还尝试调整超时:

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" **--save_summaries_secs=10000 --save_interval_secs=10000** --num_training_steps=20000 --eval

我删除了 --eval 指令,它开始训练模型:

$ ls -l /tmp/melody_rnn/logdir/run1/train/
total 11032
-rw-r--r--  1 davidlaxer  wheel      149 Jul 20 16:04 checkpoint
-rw-r--r--  1 davidlaxer  wheel  2438765 Jul 20 16:04 events.out.tfevents.1500591842.David-Laxers-MacBook-Pro.local
-rw-r--r--  1 davidlaxer  wheel  1300637 Jul 20 16:04 graph.pbtxt
-rw-r--r--  1 davidlaxer  wheel  1226008 Jul 20 16:04 model.ckpt-0.data-00000-of-00001
-rw-r--r--  1 davidlaxer  wheel     1727 Jul 20 16:04 model.ckpt-0.index
-rw-r--r--  1 davidlaxer  wheel   667410 Jul 20 16:04 model.ckpt-0.meta

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" **--save_summaries_secs=10000 --save_interval_secs=10000** --num_training_steps=20000 
Killed non-responsive server process (pid=65119)
.
INFO: Found 1 target...
Target //magenta/models/melody_rnn:melody_rnn_train up-to-date:
bazel-bin/magenta/models/melody_rnn/melody_rnn_train
INFO: Elapsed time: 9.400s, Critical Path: 0.65s
INFO: Running command line: bazel-bin/magenta/models/melody_rnn/melody_rnn_train '--config=attention_rnn' '--run_dir=/tmp/melody_rnn/logdir/run1' '--sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord' '--hparams=batch_size=10,rnn_layer_sizes=[64,64]' '**--save_summaries_secs=10000' '--save_interval_secs=10000**' '--num_training_steps=20000'
INFO:tensorflow:hparams = {'rnn_layer_sizes': [64, 64], 'attn_length': 40, 'dropout_keep_prob': 0.5, 'batch_size': 10, 'clip_norm': 3, 'learning_rate': 0.001}
INFO:tensorflow:Counting records in /Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord.
INFO:tensorflow:Total records: 46
INFO:tensorflow:[<tf.Tensor 'random_shuffle_queue_Dequeue:0' shape=(?, 74) dtype=float32>, <tf.Tensor 'random_shuffle_queue_Dequeue:1' shape=(?,) dtype=int64>, <tf.Tensor 'random_shuffle_queue_Dequeue:2' shape=() dtype=int32>]
INFO:tensorflow:Train dir: /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Starting training loop...
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Error reported to Coordinator: <type 'exceptions.UnicodeDecodeError'>, 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte
INFO:tensorflow:Saving checkpoints for 0 into /tmp/melody_rnn/logdir/run1/train/model.ckpt.
Traceback (most recent call last):
File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 112, in <module>
console_entry_point()
File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 108, in console_entry_point
tf.app.run(main)
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 104, in main
checkpoints_to_keep=FLAGS.num_checkpoints)
File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/shared/events_rnn_train.py", line 71, in run_training
save_summaries_steps=summary_frequency)
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/training.py", line 530, in train
loss = session.run(train_op)
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 521, in __exit__
self._close_internal(exception_type)
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 556, in _close_internal
self._sess.close()
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 791, in close
self._sess.close()
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 888, in close
ignore_live_threads=True)
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
target_list_as_strings, status, None)
File "/Users/davidlaxer/anaconda/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 465, in raise_exception_on_not_ok_status
compat.as_text(pywrap_tensorflow.TF_Message(status)),
File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
File "/Users/davidlaxer/anaconda/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte
ERROR: Non-zero return code '1' from command: Process exited with status 1.

指定--eval时,您运行的是评估而不是训练。评估作业将等待run_dir中的检查点,如果未找到检查点,它将退出。

相关内容

  • 没有找到相关文章

最新更新