我的电脑规格是:Windows 10cuda 11.2cudnn 8.0.5Nvidia geforce GTX 3080
我用这个网站(https://github.com/armaanpriyadarshan/Training-a-Custom-TensorFlow-2.x-Object-Detector)来安装更快的rcnn。当我训练这个网络时,它有一个错误:
2021-01-24 18:12:47.713443: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-01-24 18:12:47.715010: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
2021-01-24 18:12:47.718097: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-01-24 18:12:47.719553: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
Traceback (most recent call last):
File "model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythonplatformapp.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:Anacondaenvstensorflowlibsite-packagesabslapp.py", line 300, in run
_run_main(main, args)
File "C:Anacondaenvstensorflowlibsite-packagesabslapp.py", line 251, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 104, in main
model_lib_v2.train_loop(
File "C:Anacondaenvstensorflowlibsite-packagesobject_detectionmodel_lib_v2.py", line 561, in train_loop
load_fine_tune_checkpoint(detection_model,
File "C:Anacondaenvstensorflowlibsite-packagesobject_detectionmodel_lib_v2.py", line 361, in load_fine_tune_checkpoint
strategy.run(
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythondistributedistribute_lib.py", line 1259, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythondistributedistribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythondistributemirrored_strategy.py", line 628, in _call_for_each_replica
return mirrored_run.call_for_each_replica(
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythondistributemirrored_run.py", line 75, in call_for_each_replica
return wrapped(args, kwargs)
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythoneagerdef_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythoneagerdef_function.py", line 888, in _call
return self._stateless_fn(*args, **kwds)
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythoneagerfunction.py", line 2942, in __call__
return graph_function._call_flat(
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythoneagerfunction.py", line 1918, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythoneagerfunction.py", line 555, in call
outputs = execute.execute(
File "C:Anacondaenvstensorflowlibsite-packagestensorflowpythoneagerexecute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node model/conv1_conv/Conv2D (defined at site-packagesobject_detectionmeta_architecturesfaster_rcnn_meta_arch.py:1346) ]]
[[Loss/RPNLoss/BalancedPositiveNegativeSampler/Cast_8/_192]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node model/conv1_conv/Conv2D (defined at site-packagesobject_detectionmeta_architecturesfaster_rcnn_meta_arch.py:1346) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dummy_computation_fn_16411]
Errors may have originated from an input operation.
Input Source operations connected to node model/conv1_conv/Conv2D:
model/lambda/Pad (defined at site-packagesobject_detectionmodelskeras_modelsresnet_v1.py:49)
Input Source operations connected to node model/conv1_conv/Conv2D:
model/lambda/Pad (defined at site-packagesobject_detectionmodelskeras_modelsresnet_v1.py:49)
Function call stack:
_dummy_computation_fn -> _dummy_computation_fn
如何解决这个问题?
你能分享一下你的tensorflow版本吗?我认为tensorflow<=2.4不支持高于10.1的cuda版本,这可能是导致问题的原因。
编辑:
看来你确实有tensorflow 2.4,所以我在这里建议将cuda降级到10.1,tensorflow降级到2.3,这是存储库作者的建议。或者如果你坚持使用tensorflow 2.4,你仍然应该将cuda版本降级到11.0,因为tensorflow仍然不支持cuda 11.2。