Tensorflow 2.1 无法获取卷积算法.这可能是因为 cuDNN 初始化失败



我正在使用 anaconda python 3.7 和 tensorflow 2.1 以及 cuda 10.1 和 cudnn 7.6.5,并尝试运行 retinaset (https://github.com/fizyr/keras-retinanet(:

python keras_retinanet/bin/train.py --freeze-backbone --random-transform --batch-size 8 --steps 500 --epochs 10 csv annotations.csv classes.csv

以下是由此产生的错误:

Epoch 1/10
2020-02-10 20:34:37.807590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-10 20:34:38.835777: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-02-10 20:34:39.753051: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-02-10 20:34:39.776706: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv1/convolution}}]]
Traceback (most recent call last):
File "keras_retinanet/bin/train.py", line 530, in <module>
main()
File "keras_retinanet/bin/train.py", line 525, in main
initial_epoch=args.initial_epoch
File "C:AnacondaAnaconda3.7libsite-packageskeraslegacyinterfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:AnacondaAnaconda3.7libsite-packageskerasenginetraining.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "C:AnacondaAnaconda3.7libsite-packageskerasenginetraining_generator.py", line 220, in fit_generator
reset_metrics=False)
File "C:AnacondaAnaconda3.7libsite-packageskerasenginetraining.py", line 1514, in train_on_batch
outputs = self.train_function(ins)
File "C:AnacondaAnaconda3.7libsite-packagestensorflow_corepythonkerasbackend.py", line 3727, in __call__
outputs = self._graph_fn(*converted_inputs)
File "C:AnacondaAnaconda3.7libsite-packagestensorflow_corepythoneagerfunction.py", line 1551, in __call__
return self._call_impl(args, kwargs)
File "C:AnacondaAnaconda3.7libsite-packagestensorflow_corepythoneagerfunction.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "C:AnacondaAnaconda3.7libsite-packagestensorflow_corepythoneagerfunction.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:AnacondaAnaconda3.7libsite-packagestensorflow_corepythoneagerfunction.py", line 545, in call
ctx=ctx)
File "C:AnacondaAnaconda3.7libsite-packagestensorflow_corepythoneagerexecute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/convolution (defined at C:AnacondaAnaconda3.7libsite-packageskerasbackendtensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_12376]
Function call stack:
keras_scratch_graph

有人遇到过类似的问题吗?

尝试使用tf.distribute.MirroredStrategy()在两个 GPU 上训练我的 CNN 模型时,我遇到了同样的错误。 我现在找到了一种解决方法,允许我同时使用它们(尽管在单个 GPU 上进行训练效果很好(。 尝试将以下内容放在应用程序的开头:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session =tf.compat.v1.InteractiveSession(config=config)

希望对您有所帮助!

这样做:

physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
tf.config.experimental.set_memory_growth(physical_devices[0], True)

根据 Tensorflow GitHub 问题中的评论,此错误可能是由 GPU 的内存限制被击中引起的(您可以使用命令nvidia-smigpustat检查 GPU 使用情况(。

如果设置tf.config.experimental.set_memory_growth = True不起作用,希望手动限制 GPU 内存使用量可以:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB * 2 of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024 * 2)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)

感谢布莱恩博-曹的评论。

python 3.7.9、tensorflow 2.1.0、cuda 10.1.105 和 cudnn 7.6.5 也遇到了同样的错误。从 NVIDIA 更新 GPU 驱动程序后解决。

相关内容

最新更新