TensorFlow在添加第二个GPU(CUDNN_STATUS_INTERNAL_ERROR)后不起作用

今天，我很自豪地在我的电脑上安装了第二个RTX 2070，以进一步加速TensorFlow 2.2。但相当令人失望的是，在一个GPU上运行的python脚本不再工作。我试图将其归结为一个最小的可行示例，该示例适用于该行

strategy = tf.distribute.MirroredStrategy(devices=["/cpu:0"])

如果我将此行替换为以下任何行

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
strategy = tf.distribute.MirroredStrategy()

我收到如下错误消息：

Starting training
Epoch 1/5
2020-05-23 22:52:59.205856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-23 22:52:59.400434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-23 22:53:00.881437: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-23 22:53:00.898484: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "example3.py", line 77, in <module>
main()
File "example3.py", line 70, in main
model.fit(x=training_generator, workers=1, epochs=5, steps_per_epoch = len(training_generator))
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
self.captured_inputs)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at /usr/lib64/python3.6/threading.py:916) ]]
(1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at /usr/lib64/python3.6/threading.py:916) ]]
[[div_no_nan_1/ReadVariableOp/_14]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1034]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/conv2d/Conv2D:
cond_1/Identity (defined at example3.py:70)
Input Source operations connected to node sequential/conv2d/Conv2D:
cond_1/Identity (defined at example3.py:70)
Function call stack:
train_function -> train_function
2020-05-23 22:53:00.943669: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

以下是重现错误的完整代码：

import tensorflow as tf
import numpy as np
from PIL import Image, ImageDraw
class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, BatchSize, PicX, PicY, Color):
self._BatchSize = BatchSize
self._dim = (PicX, PicY)
self._Color = Color
def __len__(self):
return 100
def create_random_form(self):
img = Image.new('RGB', self._dim, (50,50,50))
draw = ImageDraw.Draw(img)
label = np.random.randint(3)
x0 = np.random.randint(int((self._dim[0]-5)/2))+1
x1 = np.random.randint(int((self._dim[0]-5)/2))+int(self._dim[0]/2)
y0 = np.random.randint(int((self._dim[1]-5)/2))
y1 = np.random.randint(int((self._dim[1]-5)/2))+int(self._dim[1]/2)
if label == 0:
draw.rectangle((x0,y0,x1,y1), fill=self._Color)
elif label == 1:
draw.ellipse((x0,y0,x1,y1), fill=self._Color)                
else:
draw.polygon([(x0,y0),(x0,y1),(x1,y1)], fill=self._Color)     
return img, label
def __getitem__(self, index):
X = np.empty((self._BatchSize, *self._dim, 3))
y = np.empty((self._BatchSize), dtype=int)
for i in range(0,self._BatchSize):
img, label = self.create_random_form()
X[i,] = tf.keras.preprocessing.image.img_to_array(img) / 255.0
y[i] = label
return X, y
def main():
PicX = 300
PicY = 300
Color = (255,255,255)
#save_some_pics(20)
print("Starting a minimal, self-contained error reproduction")
#strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
#strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0"])
#strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
with strategy.scope():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, (9,9), activation='relu', input_shape=(PicX, PicY, 3)))
model.add(tf.keras.layers.MaxPooling2D((9,9)))        
model.add(tf.keras.layers.Conv2D(64, (9,9), activation='relu'))
model.add(tf.keras.layers.MaxPooling2D((9,9)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(64, activation='relu'))     
model.add(tf.keras.layers.Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print(model.summary())
training_generator = DataGenerator(10, PicX, PicY, Color)
print("Starting training")
model.fit(x=training_generator, workers=1, epochs=5, steps_per_epoch = len(training_generator))
test_generator = DataGenerator(10, PicX, PicY, Color)    
test_loss, test_acc = model.evaluate(test_generator)
print("Test loss {}, test accuracy {}".format(test_loss,test_acc))    
if __name__ == '__main__':
main()

在CPU上运行它可以顺利运行，就像在计算机中只有一个GPU一样，也可以

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")

我开始正常训练：

2020-05-23 23:11:56.890690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-23 23:11:56.891333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-23 23:11:56.891912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-23 23:11:56.892554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7377 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-23 23:11:56.892873: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-23 23:11:56.893483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7377 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
Starting training
Epoch 1/5
2020-05-23 23:11:58.036789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
100/100 [==============================] - 44s 438ms/step - loss: 8.8931 - accuracy: 0.4841
Epoch 2/5
100/100 [==============================] - 44s 437ms/step - loss: 0.8959 - accuracy: 0.6444

我对尝试什么没有想法，而且谷歌搜索错误消息并没有带来太多 - 任何想法都受到高度赞赏！

在尝试并阅读更多内容后，我发现了隐藏在这里的解决方案

谢谢斯里哈里-亨巴瓦迪和墨水樱桃！

解决方案是在所有GPU 上启用内存增长，如下所示：

physical_devices = tf.config.list_physical_devices('GPU')
for p in physical_devices:
tf.config.experimental.set_memory_growth(p, True)

相关内容

最新更新

热门标签：