当前,在使用Keras训练图像数据时,我正在处理一个大数据问题。我有一个目录,里面有一批.npy文件。每个批次包含512个图像。每个批次都有其相应的标签文件.npy。因此,它看起来像:{image_file_1.npy,label_file_1.py,…,image_file_37.npy,label_file_37}。每个图像文件都有维度(512, 199, 199, 3)
,每个标签文件都有尺度(512, 1)
(1或0(。如果我在一个ndarray中加载所有图像,它将是35+GB。到目前为止,阅读了所有的Keras Doc。我仍然无法找到如何使用自定义生成器进行训练。我读过关于flow_from_dict
和ImageDataGenerator(...).flow()
的文章,但在这种情况下它们并不理想,或者我不知道如何定制它们。这就是我所做的。
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
val_gen = ImageDataGenerator(rescale=1./255)
x_test = np.load("../data/val_file.npy")
y_test = np.load("../data/val_label.npy")
val_gen.fit(x_test)
model = Sequential()
...
model_1.add(layers.Dense(512, activation='relu'))
model_1.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['acc'])
model.fit_generator(generate_batch_from_directory() # should give 1 image file and 1 label file
validation_data=val_gen.flow(x_test,
y_test,
batch_size=64),
validation_steps=32)
所以这里generate_batch_from_directory()
每次都应该取image_file_i.npy
和label_file_i.npy
,并优化权重,直到没有剩余的批次为止。.npy文件中的每个图像阵列都经过了增强、旋转和缩放处理。每个.npy
文件都与来自类1和类0的数据适当混合(50/50(。
如果我附加所有批并创建一个大文件,例如:
X_train = np.append([image_file_1, ..., image_file_37])
y_train = np.append([label_file_1, ..., label_file_37])
它不适合记忆。否则,我可以使用.flow()
生成图像集来训练模型。
谢谢你的建议。
我终于解决了那个问题。但我必须浏览keras.utils.Sequence
的源代码和文档才能构建自己的生成器类。本文档有助于了解Kears发电机的工作原理。你可以在我的kaggle笔记本上阅读更多细节:
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
image_label_map = {
"image_file_{}.npy".format(i+1): "label_file_{}.npy".format(i+1)
for i in range(int(len(all_files)/2))}
partition = [item for item in all_files if "image_file" in item]
class DataGenerator(keras.utils.Sequence):
def __init__(self, file_list):
"""Constructor can be expanded,
with batch size, dimentation etc.
"""
self.file_list = file_list
self.on_epoch_end()
def __len__(self):
'Take all batches in each iteration'
return int(len(self.file_list))
def __getitem__(self, index):
'Get next batch'
# Generate indexes of the batch
indexes = self.indexes[index:(index+1)]
# single file
file_list_temp = [self.file_list[k] for k in indexes]
# Set of X_train and y_train
X, y = self.__data_generation(file_list_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.file_list))
def __data_generation(self, file_list_temp):
'Generates data containing batch_size samples'
data_loc = "datapsycho/imglake/population/train/image_files/"
# Generate data
for ID in file_list_temp:
x_file_path = os.path.join(data_loc, ID)
y_file_path = os.path.join(data_loc, image_label_map.get(ID))
# Store sample
X = np.load(x_file_path)
# Store class
y = np.load(y_file_path)
return X, y
# ====================
# train set
# ====================
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
training_generator = DataGenerator(partition)
validation_generator = ValDataGenerator(val_partition) # work same as training generator
hst = model.fit_generator(generator=training_generator,
epochs=200,
validation_data=validation_generator,
use_multiprocessing=True,
max_queue_size=32)