为什么我的Tensorflow训练无限期地暂停,没有任何错误



设置tensorflow后,验证gpu加速是否工作,设置配置,本教程中的所有内容https://github.com/nicknochnack/TFODCourse.

我运行:

py Tensorflowmodelsresearchobject_detectionmodel_main_tf2.py --model_dir=Tensorflowworkspacemodelsmy_ssd_mobnet --pipeline_config_path=Tensorflowworkspacemodelsmy_ssd_mobnetpipeline.config --num_train_steps=100

得到这些输出日志,等待一个多小时,Python持续使用我25-26%的CPU,但从未打印任何进度日志,即使我将步数降低到100,我也一无所获:

有很多警告,但也许这很正常?我在谷歌上搜索了一些INFO日志,发现它们是无害的。从这些日志中,我似乎遗漏了什么或做错了什么?以下是删除了未来弃用警告的摘要日志:

2021-07-11 02:25:42.869766: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
py Tensorflowmodelsresearchobject_detectionmodel_main_tf2.py --model_dir=Tensorflowworkspacemodelsmy_ssd_mobnet --pipeline_config_path=Tensorflowworkspacemodelsmy_ssd_mobnetpipeline.config --num_train_steps=100
2021-07-11 02:25:44.989884: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-07-11 02:25:47.588384: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021-07-11 02:25:47.605286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2080 SUPER computeCapability: 7.5
coreClock: 1.845GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 462.00GiB/s
2021-07-11 02:25:47.605366: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-07-11 02:25:47.610303: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-07-11 02:25:47.610390: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021-07-11 02:25:47.613585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021-07-11 02:25:47.614873: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021-07-11 02:25:47.621607: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021-07-11 02:25:47.623967: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021-07-11 02:25:47.624496: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-07-11 02:25:47.626311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-11 02:25:47.626728: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-11 02:25:47.627707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2080 SUPER computeCapability: 7.5
coreClock: 1.845GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 462.00GiB/s
2021-07-11 02:25:47.627810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-11 02:25:48.067610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-11 02:25:48.067778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-07-11 02:25:48.068662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-07-11 02:25:48.069323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5957 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
W0711 02:25:48.071784 10384 mirrored_strategy.py:379] Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0711 02:25:48.225363 10384 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: 100
I0711 02:25:48.229352 10384 config_util.py:552] Maybe overwriting train_steps: 100
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0711 02:25:48.230349 10384 config_util.py:552] Maybe overwriting use_bfloat16: False
INFO:tensorflow:Reading unweighted datasets: ['Tensorflow\workspace\annotations\train.record']
I0711 02:25:48.308165 10384 dataset_builder.py:163] Reading unweighted datasets: ['Tensorflow\workspace\annotations\train.record']
INFO:tensorflow:Reading record datasets for input file: ['Tensorflow\workspace\annotations\train.record']
I0711 02:25:48.309138 10384 dataset_builder.py:80] Reading record datasets for input file: ['Tensorflow\workspace\annotations\train.record']
INFO:tensorflow:Number of filenames to read: 1
I0711 02:25:48.311132 10384 dataset_builder.py:81] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0711 02:25:48.311132 10384 dataset_builder.py:87] num_readers has been reduced to 1 to match input file shards.
tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)

包括未来弃用警告在内的完整日志就在这一要点中,但唯一的区别是,如果未来弃用警告,则不应破坏任何内容。

我只是不知道如何调试它。它看起来好像在工作,然后就挂了。

我在GitHub问题中找到了一个解决方案,其他人也有同样的问题。

https://github.com/tensorflow/models/issues/9581

问题是我的TFRecord生成脚本找不到任何图像,并创建了空的记录文件。不幸的是,生成脚本和Tensorflow在这种情况下都默默地失败了。

相关内容

  • 没有找到相关文章

最新更新