调用"tf.keras.Model.fit"后内存泄漏,训练未开始



我使用的是我的yolo实现,它在2.5之前的tensorflow版本上运行良好。我最近尝试在一个小数据集(使用tf.keras.Model.fit)上训练yol3。这是一个colab笔记本,你可以用它来重现这个问题。调用model.fit后不久,以下消息不断重复:

/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)

INFO:tensorflow:Assets written to: ram://eefa3127-ad7d-4445-a186-75fd8f0b81e1/assets

然后,内存使用量在没有明显原因的情况下不断增长,最终发生内存崩溃。(这在早期的tensorflow版本<=2.5中不会发生)。您可以验证,因此使用另一个使用tensorflow 2.5的笔记本电脑,事情应该会非常顺利,训练也会如预期进行。我还尝试安装tensorflow 2.8,而不是colab的默认版本(2.7),但问题仍然存在。

以下是包含问题的输出(tensorflow>2.5):

2022-02-07 05:52:00,476 yolo_tf2.utils.common.activate_gpu +325: INFO     [260] GPU activated
2022-02-07 05:52:00,477 yolo_tf2.utils.common.train +468: INFO     [260] Starting training ...
2022-02-07 05:52:04,293 yolo_tf2.utils.common.create_models +447: INFO     [260] Training and inference models created
2022-02-07 05:52:04,295 yolo_tf2.utils.common.wrapper +64: INFO     [260] create_models execution time: 3.8118433569999866 seconds
2022-02-07 05:52:04,301 yolo_tf2.utils.common.create_new_dataset +366: INFO     [260] Generating new dataset ...
2022-02-07 05:52:07,014 yolo_tf2.utils.common.adjust_non_voc_csv +184: INFO     [260] Adjustment from existing received 10107 labels containing 16 classes
2022-02-07 05:52:07,022 yolo_tf2.utils.common.adjust_non_voc_csv +187: INFO     [260] Added prefix to images: /content/yolo-data/images
Parsed labels:
Car               3153
Pedestrian        1418
Palm Tree         1379
Traffic Lights    1269
Street Sign       1109
Street Lamp        995
Road Block         363
Flag               124
Trash Can           90
Minivan             68
Fire Hydrant        52
Bus                 43
Pickup Truck        20
Bicycle             17
Delivery Truck       4
Motorcycle           3
Name: object_name, dtype: int64
2022-02-07 05:52:09,513 yolo_tf2.utils.common.save_fig +33: INFO     [260] Saved figure /content/output/plots/Relative width and height for 10107 boxes..png
/usr/local/lib/python3.7/dist-packages/yolo_tf2/utils/dataset_handlers.py:209: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
groups = np.array(data.groupby('image_path'))
Processing beverly_hills_train.tfrecord
Building example: 406/411 ... Beverly_hills184.jpg 99% completed2022-02-07 05:52:12,922 yolo_tf2.utils.common.save_tfr +227: INFO     [260] Saved training TFRecord: /content/data/tfrecords/beverly_hills_train.tfrecord
Building example: 411/411 ... Beverly_hills365.jpg 100% completed
Processing beverly_hills_test.tfrecord
Building example: 31/46 ... Beverly_hills335.jpg 67% completed2022-02-07 05:52:13,175 yolo_tf2.utils.common.save_tfr +229: INFO     [260] Saved validation TFRecord: /content/data/tfrecords/beverly_hills_test.tfrecord
2022-02-07 05:52:13,271 yolo_tf2.utils.common.read_tfr +263: INFO     [260] Read TFRecord: /content/data/tfrecords/beverly_hills_train.tfrecord
Building example: 46/46 ... Beverly_hills186.jpg 100% completed
2022-02-07 05:52:18,892 yolo_tf2.utils.common.read_tfr +263: INFO     [260] Read TFRecord: /content/data/tfrecords/beverly_hills_test.tfrecord
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
2022-02-07 05:52:50.575910: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: ram://eefa3127-ad7d-4445-a186-75fd8f0b81e1/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://cbe6d5a4-5322-494b-ba91-3fd34131cdd9/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://f15f3f25-9adb-4eb0-aa0d-83fa874bc74e/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://86dd6f5f-4416-4465-99c0-928fd88e8a93/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://ca08220f-cabc-4017-96d3-383557342388/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://0f634207-e822-4d6c-a805-3cfeab37532f/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://a971d021-3da4-402a-a004-4ae4aa67148a/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://31d72fdf-1ce6-4131-a7e6-f6444747e9c9/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://dac323b6-591a-481c-bbe6-85bb82bef38c/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://99b029f7-11d1-40f2-b459-fd1d8dca5ba1/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)
INFO:tensorflow:Assets written to: ram://210489fb-0895-4769-8be3-effd01d92695/assets
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:1410: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
layer_config = serialize_layer_fn(layer)

以下是没有问题的输出(tensorflow 2.5):

2022-02-07 06:09:53.125735: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-07 06:09:55.370728: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-02-07 06:09:55,387 yolo_tf2.utils.common.train +468: INFO     [269] Starting training ...
2022-02-07 06:09:55.387211: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-02-07 06:09:55.387252: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (de0312867ce7): /proc/driver/nvidia/version does not exist
2022-02-07 06:09:55.427963: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-07 06:10:00,078 yolo_tf2.utils.common.create_models +447: INFO     [269] Training and inference models created
2022-02-07 06:10:00,080 yolo_tf2.utils.common.wrapper +64: INFO     [269] create_models execution time: 4.689235652999997 seconds
2022-02-07 06:10:00,081 yolo_tf2.utils.common.create_new_dataset +366: INFO     [269] Generating new dataset ...
2022-02-07 06:10:02,572 yolo_tf2.utils.common.adjust_non_voc_csv +184: INFO     [269] Adjustment from existing received 10107 labels containing 16 classes
2022-02-07 06:10:02,574 yolo_tf2.utils.common.adjust_non_voc_csv +187: INFO     [269] Added prefix to images: /content/yolo-data/images
Parsed labels:
Car               3153
Pedestrian        1418
Palm Tree         1379
Traffic Lights    1269
Street Sign       1109
Street Lamp        995
Road Block         363
Flag               124
Trash Can           90
Minivan             68
Fire Hydrant        52
Bus                 43
Pickup Truck        20
Bicycle             17
Delivery Truck       4
Motorcycle           3
Name: object_name, dtype: int64
2022-02-07 06:10:04,900 yolo_tf2.utils.common.save_fig +33: INFO     [269] Saved figure /content/output/plots/Relative width and height for 10107 boxes..png
/usr/local/lib/python3.7/dist-packages/yolo_tf2/utils/dataset_handlers.py:209: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
groups = np.array(data.groupby('image_path'))
Processing beverly_hills_train.tfrecord
Building example: 392/411 ... Beverly_hills294.jpg 95% completed2022-02-07 06:10:10,341 yolo_tf2.utils.common.save_tfr +227: INFO     [269] Saved training TFRecord: /content/data/tfrecords/beverly_hills_train.tfrecord
Building example: 411/411 ... Beverly_hills94.jpg 100% completed
Processing beverly_hills_test.tfrecord
Building example: 25/46 ... Beverly_hills334.jpg 54% completed2022-02-07 06:10:10,730 yolo_tf2.utils.common.save_tfr +229: INFO     [269] Saved validation TFRecord: /content/data/tfrecords/beverly_hills_test.tfrecord
Building example: 46/46 ... Beverly_hills251.jpg 100% completed
2022-02-07 06:10:10,843 yolo_tf2.utils.common.read_tfr +263: INFO     [269] Read TFRecord: /content/data/tfrecords/beverly_hills_train.tfrecord
2022-02-07 06:10:15,264 yolo_tf2.utils.common.read_tfr +263: INFO     [269] Read TFRecord: /content/data/tfrecords/beverly_hills_test.tfrecord
2022-02-07 06:10:15.676352: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2022-02-07 06:10:15.676423: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2022-02-07 06:10:15.701051: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
category=CustomMaskWarning)
2022-02-07 06:10:17.064324: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-02-07 06:10:17.081408: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2199995000 Hz
Epoch 1/100
1/Unknown - 40s 40s/step - loss: 7333.2617 - layer_205_lambda_loss: 403.8862 - layer_230_lambda_loss: 1509.9465 - layer_255_lambda_loss: 5407.68902022-02-07 06:10:59.974130: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2022-02-07 06:10:59.974196: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2/Unknown - 50s 11s/step - loss: 7819.7124 - layer_205_lambda_loss: 697.4546 - layer_230_lambda_loss: 1647.7856 - layer_255_lambda_loss: 5462.71582022-02-07 06:11:10.059899: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-02-07 06:11:10.088821: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2022-02-07 06:11:10.133747: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10
2022-02-07 06:11:10.157875: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for trace.json.gz to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.trace.json.gz
2022-02-07 06:11:10.189438: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10
2022-02-07 06:11:10.189678: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for memory_profile.json.gz to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.memory_profile.json.gz
2022-02-07 06:11:10.192796: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10Dumped tool data for xplane.pb to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.xplane.pb
Dumped tool data for overview_page.pb to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.overview_page.pb
Dumped tool data for input_pipeline.pb to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /content/data/tfrecords/train/plugins/profile/2022_02_07_06_11_10/de0312867ce7.kernel_stats.pb
15/Unknown - 181s 10s/step - loss: 3493.4009 - layer_205_lambda_loss: 232.7220 - layer_230_lambda_loss: 629.6332 - layer_255_lambda_loss: 2618.9722

我也尝试过(同样的结果):

  • python版本:3.8、3.9、3.10
  • Ubuntu 18和OSX
  • tensorflow版本:2.7、2.8、2.9.0-dev20220203

我可以确认,即使使用了2.8.0的最新夜间版本(2.8.0-dev20211222),tensorflow 2.8中仍然会遇到同样的错误。这在几个tf.keras模型中也遇到过(其中一些在tensorflow 2.2版本中运行良好,但在2.8版本中的几个时期后就停止了训练)。即使是非常小的数据集。通过将错误爆炸延迟到以后的时期,减少批量大小可以稍微改善训练,但这并不能解决问题。

我已经尝试使用os.environ["TF_GPU_ALLOCATOR"]=";cuda_malloc_async";但徒劳无功。

以下是训练时错误输出的示例(训练期间内存使用量不断增加,直到发生OOM内存突发问题的第60个历元)。

2022-03-11 16:33:20.651586:I tensorflow/core/common_runtime/gpu/gpu_device.cc:11525]创建的设备/作业:localhost/副本:0/任务:0/设备:GPU:0,内存为11433 MB:->设备:0,名称:NVIDIA TITAN X(Pascal),pci总线id:0000:02:00.00,计算能力:6.1 2022-03-11 16:33:35.198049:Itensorflow/stream_executor/cuda/cuda_dnn.cc:368]已加载cuDNN版本8101 2022-03-11 16:42:26.89101:Wtensorflow/core/common_runtime/bfc_allocater.cc:462]分配器(GPU_0_bfc)在尝试分配98.00MiB时内存不足(四舍五入为102760448)gradient_tape/model_2/conv_3_2/Conv2D/Conv2DBackropInput如果原因内存碎片可能是环境变量吗"TF_GPU_ALLOCATOR=cuda_malloc_async"将改善这种情况。当前分配汇总如下。当前分配汇总如下。2022-03-11 16:42:26.8891975:Itensorflow/core/common_runtime/bfc_allocater.cc:1010]BFCA分配器GPU_0_bfc的转储2022-03-11 16:42:26.892028:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(256):区块总数:179,正在使用的区块:173。44.8KiB分配给块。43.2KiB在垃圾箱中使用。16.7KiB客户端请求在bin中使用。2022-03-11 16:42:26.892060:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(512):区块总数:58,正在使用的区块:52。29.2KiB分配给块。26.2KiB在垃圾桶中使用。26.0KiB客户端请求在bin中使用。2022-03-11 16:42:26.892093:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(1024):区块总数:203,正在使用的区块:202。212.5KiB分配给块。211.5KiB在垃圾箱中使用。202.0KiB客户端请求在bin中使用。2022-03-11 16:42:26.892639:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(2048):区块总数:1,正在使用的区块:0。2.0KiB分配给块。0B英寸在垃圾箱中使用。0B客户端请求在bin中使用。2022-03-1116:42:26.892864:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(4096):区块总数:0,正在使用的区块:0。0B分配给块。0B在使用中在垃圾箱中。0B客户端请求在bin中使用。2022-03-11 16:42:26.8892915:I tensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(8192):区块总数:22,正在使用的区块:22。179.0KiB分配给块。179.0KiB在垃圾箱中使用。176.0KiB客户端请求在bin中使用。2022-03-11 16:42:26.8892954:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(16384):区块总数:21,正在使用的区块:18。518.8KiB分配给块。446.2KiB在垃圾箱中使用。442.0KiB客户端请求在bin中使用。2022-03-11 16:42:26.892989:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(32768):区块总数:4,正在使用的区块:3。184.0KiB分配给块。134.0KiB在垃圾箱中使用。81.0KiB客户端请求在bin中使用。2022-03-11 16:42:26.893018:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(65536):区块总数:0,正在使用的区块:0。0B分配给块。0B在使用中在垃圾箱中。0B客户端请求在bin中使用。2022-03-11 16:42:26.893050:I tensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(131072):区块总数:6,正在使用的区块:5。948.5KiB分配给块。804.5KiB在垃圾箱中使用。请求在bin中使用720.0KiB客户端。2022-03-11 16:42:26.8930079:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(262144):区块总数:5,正在使用的区块:5。1.58MiB分配给块。1.58MiB在仓中使用。1.41MiB客户端请求在bin中使用。2022-03-11 16:42:26.893297:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(524288):区块总数:18,正在使用的区块:14。10.70MiB分配给块。8.12MiB在垃圾箱中使用。7.88MiB客户端请求在bin中使用。2022-03-11 16:42:26.893332:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(1048576):区块总数:8,正在使用的区块:5。9.51MiB分配给块。6.06MiB在垃圾箱中使用。5.65MiB客户端请求在bin中使用。2022-03-11 16:42:26.893367:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(2097152):区块总数:19,正在使用的区块:19。44.25MiB分配给块。44.25MiB在垃圾箱中使用。42.75MiB客户端请求在bin中使用。2022-03-11 16:42:26.893402:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(4194304):区块总数:12个,正在使用的区块:10个。68.09MiB分配给块。54.09MiB在垃圾箱中使用。53.22MiB客户端请求在bin中使用。2022-03-11 16:42:26.893431:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(8388608):区块总数:0,正在使用的区块:0。0B分配给块。0B在使用中在垃圾箱中。0B客户端请求在bin中使用。2022-03-11 16:42:26.893464:I tensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(16777216):总块数:15,在用块数:14。已分配3600.7MiB对于块。335.83MiB在垃圾箱中使用。311.38MiB客户端请求使用在垃圾箱中。2022-03-11 16:42:26.893497:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(33554432):区块总数:13,正在使用的区块:12。615.82MiB分配给块。578.18MiB在垃圾箱中使用。563.50MiB客户端请求在垃圾箱中使用。2022-03-11 16:42:26.8893709:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(67108864):区块总数:78,正在使用的区块:78。6.23GiB分配给块。6.23GiB在垃圾箱中使用。5.94GiB客户端请求在bin中使用。2022-03-11 16:42:26.893742:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(134217728):区块总数:7,正在使用的区块:7。1.29GiB分配给块。1.29GiB在垃圾箱中使用。1.24GiB客户端请求在bin中使用。2022-03-11 16:42:26.8893774:Itensorflow/core/common_runtime/bfc_allocater.cc:1017]Bin(268435456):区块总数:6,正在使用的区块:6。2.56GiB分配给块。垃圾箱中使用的2.56GiB。2.45GiB客户端请求在bin中使用。2022-03-11 16:42:26.895086:Itensorflow/core/common_runtime/bfc_allocater.cc:1033]98.00MiB的Bin是64.00MiB,Chunk州:2022-03-11 16:42:26.895129:Itensorflow/core/common_runtime/bfc_allocater.cc:1046]的下一个区域尺码11988434944 2022-03-11 16:42:26.895161:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸1280的7fd8cc000000接下来1 2022-03-11 16:42:26.895380:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000500接下来2 2022-03-11 16:42:26.895411:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000600接下来3 2022-03-11 16:42:26.895433:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000700接下来4 2022-03-11 16:42:26.895620:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000800接下来5 2022-03-11 16:42:26.895656:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000900接下来6 2022-03-11 16:42:26.895680:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000a00接下来7 2022-03-11 16:42:26.8895701:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸为256的7fd8cc000b00接下来8 2022-03-11 16:42:26.895721:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在尺寸256的7fd8cc000c00接下来11 2022-03-11 16:42:26.895742:Itensorflow/core/common_runtime/bfc_allocater.cc:1066]在。。。tensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块77690368总计74.09MiB 2022-03-11 16:42:26.914528:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块82451456总计78.63 MiB 2022-03-11 16:42:26.914551:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块82456576总计78.64米B 2022-03-11 16:42:26.914573:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块87380992总计83.33 MiB 2022-03-11 16:42:26.914594:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块89632256总计85.48MiB 2022-03-11 16:42:26.914617:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块90040832总计85.87 MiB 2022-03-11 16:42:26.914639:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块95168768总计90.76 MiB 2022-03-11 16:42:26.914660:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块96234496总计91.78 MiB 2022-03-11 16:42:26.914682:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块99255296,总计94.66 MiB 2022-03-11 16:42:26.914705:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块99264512总计94.67亿B 2022-03-11 16:42:26.914726:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]13块大小102760448总计1.24GiB 2022-03-11 16:42:26.914747:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块104173312总计99.35MiB 2022-03-11 16:42:26.914770:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块105158656总计100.29MiB 2022-03-11 16:42:26.914792:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块108818432总计103.78百万iB 2022-03-11 16:42:26.9914812:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块110231552总计105.12 MiB 2022-03-11 16:42:26.914832:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块119567360总计114.03 MiB 2022-03-11 16:42:26.914877:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块127390208总计121.49 MiB 2022-03-11 16:42:26.914900:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块154140672总计147.00 MiB 2022-03-11 16:42:26.914921:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]6个大小的块205520896总计1.15GiB 2022-03-11 16:42:26.914943:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]5个大小的块411041792总计1.91GiB 2022-03-11 16:42:26.914965:Itensorflow/core/common_runtime/bfc_allocater.cc:1074]1大小的块692452608总计660.37MiB 2022-03-11 16:42:26.914986:Itensorflow/core/common_runtime/bfc_allocater.cc:1078]总计在用区块:11.08GiB 2022-03-11 16:42:26.915006:Itensorflow/core/common_runtime/bfc_allocater.cc:1080]total_region_lallocated_bytes_1988434944 memory_limit_1988434941可用字节:0 curr_region_lallocation_bytes_:239768698882022-03-11 16:42:26.915035:Itensorflow/core/common_runtime/bfc_allocater.cc:1086]统计:限制:
11988434944使用:11902258688最大使用:
11902527488 NumAllocs:1785814最大AllocSize:
1866465280保留:0峰值保留:
0 LargestFreeBlock:0

2022-03-11 16:42:26.915131:Wtensorflow/core/common_runtime/bfc_allocater.cc:474]***************************************************************************************************x 2022-03-11 16:42:26.9916684:Wtensorflow/core/framework/op_kernel.cc:1745]op_REQUIRES在失败conv_grad_input_ops.cc:408:RESOURCE_EXHAUSTD:分配时OOM具有形状[128256,28]和类型float的张量/作业:localhost/副本:0/任务:0/设备:GPU:0(通过分配器GPU_0_bfc )

是否有任何关于如何防止这种情况的线索,以及是否已经或已经解决了这种错误?

如果需要,我很乐意提供更多细节。谢谢

最新更新