在 Ubuntu 18.04 上尝试在 Python (Anaconda) 中适应 keras 模型时如何纠正"Segmentation fault (core dumped)"错误

我有一台新电脑(在Ubuntu 18.04上)，它有一个2080Ti GPU。我正在尝试使用Keras(在Anaconda环境中)在Python中训练神经网络，但在尝试拟合模型时遇到了"分割错误(核心转储)"错误。

我使用的代码在我的Windows PC上运行完全正常(有一个1080Ti GPU)。这个错误似乎与GPU内存有关，当我在安装模型之前运行"nvidia smi"时，我可以看到发生了一些奇怪的事情。在流程部分，我可以看到这与anaconda环境有关(即…ics link/anaconda3/envs/py35/bin/python=9677MiB)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25       Driver Version: 415.25       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:04:00.0  On |                  N/A |
| 28%   44C    P2    51W / 250W |  10491MiB / 10986MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1507      G   /usr/lib/xorg/Xorg                            30MiB |
|    0      1538      G   /usr/bin/gnome-shell                          57MiB |
|    0      1844      G   /usr/lib/xorg/Xorg                           309MiB |
|    0      1979      G   /usr/bin/gnome-shell                         177MiB |
|    0      3816      G   /usr/lib/firefox/firefox                       6MiB |
|    0      5451      G   ...-token=169F1B80118E535BC5002C22A81DD0FA    90MiB |
|    0      5896      G   ...-token=631C5DCD90ADCF80959770937CE797E7   128MiB |
|    0      6485      C   ...ics-link/anaconda3/envs/py35/bin/python  9677MiB |
+-----------------------------------------------------------------------------+

这是代码，仅供参考：

from __future__ import print_function
import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Activation, BatchNormalization
from keras.callbacks import ModelCheckpoint, CSVLogger
from keras import backend as K
import numpy as np
batch_size = 64
num_classes = 10
epochs = 10
# input image dimensions
img_rows, img_cols = 32, 32
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 3, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 3)
input_shape = (img_rows, img_cols, 3)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# normalise pixel values
mean = np.mean(x_train,axis=(0,1,2,3))
std = np.std(x_train,axis=(0,1,2,3))
x_train = (x_train-mean)/(std+1e-7)
x_test = (x_test-mean)/(std+1e-7)
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(Conv2D(64, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.25))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
#load weights from previous run
#model.load_weights('model07_weights_best.hdf5')
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
featurewise_center=False,  # set input mean to 0 over the dataset
samplewise_center=False,  # set each sample mean to 0
featurewise_std_normalization=False,  # divide inputs by std of the dataset
samplewise_std_normalization=False,  # divide each input by its std
zca_whitening=False,  # apply ZCA whitening
rotation_range=0.1,  # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
horizontal_flip=True,  # randomly flip images
vertical_flip=False)  # randomly flip images
# Compute quantities required for feature-wise normalization
# (std, mean, and principal components if ZCA whitening is applied).
datagen.fit(x_train)

#save weights and log
checkpoint = ModelCheckpoint("model14_weights_best.hdf5", monitor='val_acc', verbose=1, save_best_only=True, mode='max')
csv_logger = CSVLogger('model14_loss_log.csv', append=True, separator=';')
callbacks_list = [checkpoint,csv_logger]
# Fit the model on the batches generated by datagen.flow().
model.fit_generator(datagen.flow(x_train, y_train,
batch_size=batch_size),
epochs=epochs,
validation_data=(x_test, y_test),
callbacks = callbacks_list
)

我不希望任何东西会占用GPU上的大量空间，但它似乎已经饱和了。正如我提到的，它在我的Windows PC上工作。

有什么想法可能导致这种情况吗？

我不认为这与内存大小有关。我最近一直在处理这个问题。分段错误表示GPU上训练过程的并行化失败。如果进程按顺序运行，无论数据集有多大，都不会出现此错误。此外，也无需担心你的深度学习环境。

由于您正要建立一台新机器，我相信在您的上下文中出现分段错误的原因肯定有两个。

首先，我会去检查我的GPU是否安装正确，但根据您提供的详细信息，我认为问题更多地是关于模块(在您的情况下是Keras)的第二个原因：

在这种情况下，您可能在安装模块或其依赖项时遇到了一些奇怪的事情。我建议将其删除并清理所有内容，然后重新安装。
你确定你的tensorflow gpu安装正确了吗？库达和库恩呢？

如果您认为keras安装正确，请尝试以下测试代码：

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

这将打印您的tensorflow是使用CPU还是GPU后端。

如果以上步骤都顺利，我怀疑您是否会再次出现分割错误

请检查此参考以在GPU上进行tensorflow测试。

如果这是内存问题，那么您可以用较低的批处理大小来训练它。尝试将批量大小减少到32，如果不起作用，请继续减少，直到批量大小为1，并观察GPU的使用情况。

此外，在代码顶部添加以下代码，它将动态分配GPU内存。因此，您可以看到在较小的批量大小下使用/需要多少GPU内存。

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
# (nothing gets printed in Jupyter, only if you run it standalone)
sess = tf.Session(config=config)
set_session(sess)  # set this TensorFlow session as the default session for Keras

来源：https://github.com/keras-team/keras/issues/4161

我希望这会有所帮助。

相关内容

最新更新

热门标签：