为什么TensorFlow将所有数据放入系统内存中?

我一直得到系统内存(不是GPU内存)的OOM错误，但我不确定哪个函数导致tensorflow将所有内容加载到RAM中。一个月前，我在一个不同的数据集上运行了一个图像分类器，并复制了代码，只做了一些小的改变。因此，与之前的数据集相比，有两个变化可能导致OOM。1)图像尺寸要大得多，但我在早期将它们调整为224x224，所以我认为它在运行时不应该有任何影响。2)数据集的大小是原来的两倍，但这次我没有使用缓存或shuffle，所以我不确定为什么它不只是批量大小被加载到内存中。

def read_and_decode(filename, label):
# Returns a tensor with byte values of the entire contents of the input filename.
img = tf.io.read_file(filename)
# Decoding raw JPEG tensor data into 3D (RGB) uint8 pixel value tensor
img = tf.io.decode_jpeg(img, channels=3)
#Resize
img = tf.image.resize_with_pad(
img,
224,
224,
method=tf.image.ResizeMethod.BILINEAR,
antialias=False
)
img = preprocess_input(img)
return img, label
ds_oh = tf.data.Dataset.from_tensor_slices((img_paths, oh_input))
ds_oh = ds_oh.map(read_and_decode)

所有数据现在都在ds_oh中，大小为224x224，标签正确。

def ds_split(ds, ds_size, shuffle_size, train_split=0.8, val_split=0.2, shuffle=True):
assert (train_split + val_split) == 1

if shuffle:
ds = ds.shuffle(shuffle_size, seed=99)

train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)

train_ds = ds.take(train_size)    
val_ds = ds.skip(train_size).take(val_size)

return train_ds, val_ds
train_ds, val_ds = ds_split(ds_oh, len(img_paths), len(img_paths), train_split=0.8, val_split=0.2, shuffle=True)

分为训练和验证数据集，洗牌。

#One hot
#train_ds = train_ds.cache()
#train_ds = train_ds.shuffle(buffer_size=len(img_paths), reshuffle_each_iteration=True)
train_ds = train_ds.batch(BATCH_SIZE)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
#val_ds = val_ds.cache()
val_ds = val_ds.batch(BATCH_SIZE)
val_ds = val_ds.prefetch(tf.data.AUTOTUNE)

批处理和预取，删除OOM错误的缓存和洗牌。

# input layers
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))(inputs)
# creating our new model head to combine with the ResNet base model
head_model = MaxPool2D(pool_size=(4, 4))(base_model)
head_model = Flatten(name='flatten')(head_model)
head_model = Dense(1024, activation='relu')(head_model)
head_model = Dropout(0.2)(head_model)
head_model = Dense(512, activation='relu')(head_model)
head_model = Dropout(0.2)(head_model)
head_model = Dense(29, activation='softmax')(head_model)
# final configuration
model = Model(inputs, head_model)
model.layers[2].trainable = False
optimizer = SGD(learning_rate=0.01, momentum=0.9)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

模型构建

INITIAL_EPOCHS = 35
history = model.fit(train_ds,
epochs=INITIAL_EPOCHS,
validation_data=val_ds)

Epoch 1/35

在第一个纪元前失败

对于任何想知道的人来说，问题在于分配我的tf。我在网上找到的函数(ds_split)由于某种原因导致了内存泄漏。即使使用td.data.take(1)也会导致OOM错误。我在网上找到了一个类似的功能，我试了第二个，也遇到了同样的问题。

我决定在图像文件路径和标签列表上使用scikitlearn的train_testrongplit，并由两个tf构建。数据，数据集。现在一切似乎都正常了。

相关内容

最新更新

热门标签：