我一直得到系统内存(不是GPU内存)的OOM错误,但我不确定哪个函数导致tensorflow将所有内容加载到RAM中。一个月前,我在一个不同的数据集上运行了一个图像分类器,并复制了代码,只做了一些小的改变。因此,与之前的数据集相比,有两个变化可能导致OOM。1)图像尺寸要大得多,但我在早期将它们调整为224x224,所以我认为它在运行时不应该有任何影响。2)数据集的大小是原来的两倍,但这次我没有使用缓存或shuffle,所以我不确定为什么它不只是批量大小被加载到内存中。
def read_and_decode(filename, label):
# Returns a tensor with byte values of the entire contents of the input filename.
img = tf.io.read_file(filename)
# Decoding raw JPEG tensor data into 3D (RGB) uint8 pixel value tensor
img = tf.io.decode_jpeg(img, channels=3)
#Resize
img = tf.image.resize_with_pad(
img,
224,
224,
method=tf.image.ResizeMethod.BILINEAR,
antialias=False
)
img = preprocess_input(img)
return img, label
ds_oh = tf.data.Dataset.from_tensor_slices((img_paths, oh_input))
ds_oh = ds_oh.map(read_and_decode)
所有数据现在都在ds_oh中,大小为224x224,标签正确。
def ds_split(ds, ds_size, shuffle_size, train_split=0.8, val_split=0.2, shuffle=True):
assert (train_split + val_split) == 1
if shuffle:
ds = ds.shuffle(shuffle_size, seed=99)
train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
return train_ds, val_ds
train_ds, val_ds = ds_split(ds_oh, len(img_paths), len(img_paths), train_split=0.8, val_split=0.2, shuffle=True)
分为训练和验证数据集,洗牌。
#One hot
#train_ds = train_ds.cache()
#train_ds = train_ds.shuffle(buffer_size=len(img_paths), reshuffle_each_iteration=True)
train_ds = train_ds.batch(BATCH_SIZE)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
#val_ds = val_ds.cache()
val_ds = val_ds.batch(BATCH_SIZE)
val_ds = val_ds.prefetch(tf.data.AUTOTUNE)
批处理和预取,删除OOM错误的缓存和洗牌。
# input layers
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))(inputs)
# creating our new model head to combine with the ResNet base model
head_model = MaxPool2D(pool_size=(4, 4))(base_model)
head_model = Flatten(name='flatten')(head_model)
head_model = Dense(1024, activation='relu')(head_model)
head_model = Dropout(0.2)(head_model)
head_model = Dense(512, activation='relu')(head_model)
head_model = Dropout(0.2)(head_model)
head_model = Dense(29, activation='softmax')(head_model)
# final configuration
model = Model(inputs, head_model)
model.layers[2].trainable = False
optimizer = SGD(learning_rate=0.01, momentum=0.9)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
模型构建
INITIAL_EPOCHS = 35
history = model.fit(train_ds,
epochs=INITIAL_EPOCHS,
validation_data=val_ds)
Epoch 1/35
在第一个纪元前失败
对于任何想知道的人来说,问题在于分配我的tf。我在网上找到的函数(ds_split)由于某种原因导致了内存泄漏。即使使用td.data.take(1)也会导致OOM错误。我在网上找到了一个类似的功能,我试了第二个,也遇到了同样的问题。
我决定在图像文件路径和标签列表上使用scikitlearn的train_testrongplit,并由两个tf构建。数据,数据集。现在一切似乎都正常了。