张量有 NaN 值 - TensorFlow 更快 - rccn 训练错误



这是我得到的cmd输出。它随错误前执行的步骤数而变化,但始终小于 20。

C:UserseduptDocumentsGitHubProject>python object_detection/train.py  --logtostderr  --train_dir=train  --pipeline_config_path=faster_rcnn_resnet101.config
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From C:UserseduptDocumentsGitHubProjectobject_detectiontrainer.py:176: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
WARNING:tensorflow:From C:UserseduptDocumentsGitHubProjectobject_detectioncorepreprocessor.py:1922: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From C:UserseduptDocumentsGitHubProjectobject_detectioncorebox_predictor.py:371: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From C:UserseduptDocumentsGitHubProjectobject_detectioncorelosses.py:269: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See @{tf.nn.softmax_cross_entropy_with_logits_v2}.
WARNING:tensorflow:From C:UserseduptDocumentsGitHubProjectobject_detectionbuildersoptimizer_builder.py:105: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonopsgradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
WARNING:tensorflow:From C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowcontribslimpythonslimlearning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-10-01 21:26:27.032708: I T:srcgithubtensorflowtensorflowcoreplatformcpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-10-01 21:26:27.350000: I T:srcgithubtensorflowtensorflowcorecommon_runtimegpugpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.96GiB
2018-10-01 21:26:27.356938: I T:srcgithubtensorflowtensorflowcorecommon_runtimegpugpu_device.cc:1471] Adding visible gpu devices: 0
2018-10-01 21:26:29.297942: I T:srcgithubtensorflowtensorflowcorecommon_runtimegpugpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-01 21:26:29.300781: I T:srcgithubtensorflowtensorflowcorecommon_runtimegpugpu_device.cc:958]      0
2018-10-01 21:26:29.302611: I T:srcgithubtensorflowtensorflowcorecommon_runtimegpugpu_device.cc:971] 0:   N
2018-10-01 21:26:29.305150: I T:srcgithubtensorflowtensorflowcorecommon_runtimegpugpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4726 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from trainmodel.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path trainmodel.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 4.7042 (19.631 sec/step)
INFO:tensorflow:global step 2: loss = 4.7257 (0.878 sec/step)
INFO:tensorflow:global step 3: loss = 4.4725 (0.851 sec/step)
INFO:tensorflow:global step 4: loss = 4.2467 (0.832 sec/step)
INFO:tensorflow:global step 5: loss = 4.0482 (0.922 sec/step)
INFO:tensorflow:global step 6: loss = 3.8669 (0.647 sec/step)
INFO:tensorflow:global step 7: loss = 3.7094 (0.731 sec/step)
INFO:tensorflow:global step 8: loss = 3.2892 (0.629 sec/step)
INFO:tensorflow:global step 9: loss = 3.6964 (0.608 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](clone_loss/_3493)]]
Caused by op 'CheckNumerics', defined at:
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonplatformapp.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:UserseduptDocumentsGitHubProjectobject_detectiontrainer.py", line 227, in train
total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonopsgen_array_ops.py", line 968, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonframeworkop_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonframeworkops.py", line 3414, in create_op
op_def=op_def)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonframeworkops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](clone_loss/_3493)]]
Traceback (most recent call last):
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 1322, in _do_call
return fn(*args)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](clone_loss/_3493)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonplatformapp.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:UserseduptDocumentsGitHubProjectobject_detectiontrainer.py", line 296, in train
saver=saver)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowcontribslimpythonslimlearning.py", line 770, in train
sess, train_op, global_step, train_step_kwargs)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowcontribslimpythonslimlearning.py", line 487, in train_step
run_metadata=run_metadata)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 900, in run
run_metadata_ptr)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 1316, in _do_run
run_metadata)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonclientsession.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](clone_loss/_3493)]]
Caused by op 'CheckNumerics', defined at:
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonplatformapp.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:UserseduptDocumentsGitHubLEGO-ID-Projectobject_detectiontrainer.py", line 227, in train
total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonopsgen_array_ops.py", line 968, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonframeworkop_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonframeworkops.py", line 3414, in create_op
op_def=op_def)
File "C:UserseduptAppDataLocalProgramsPythonPython36libsite-packagestensorflowpythonframeworkops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](clone_loss/_3493)]]

我尝试降低学习率并增加批量大小,但这无济于事。我不认为这是我的注释,因为我对这些注释进行了错误检查。我已经尝试了很多其他东西,有类似错误的人没有得到成功。

我只是看到了这个,因为我遇到了同样的问题。我通过这些更改更改了generate_tfrecord.py:

for index, row in group.object.iterrows():
if (row['xmin'] / width) >= (row['xmax'] / width):
pass
elif (row['ymin'] / height) >= (row['ymax'] / height):
pass
else:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(class_text_to_int(row['class']))

如果您有大量注释要完成,这基本上可以为您节省大量时间。我希望这对某人有所帮助。

事实证明,问题毕竟是我的注释。当我注意到它在同一步骤上崩溃但在重新创建随机排序的 TF 记录文件时更改此位置时,这一点变得很清楚。

错误在于我的一些注释文件具有错误的最大值和最小值。

最新更新