Big HDF5数据集，每个时期之后如何有效地散发

我当前正在使用一个大图像数据集（〜60GB）来训练CNN（keras/tensorflow）进行简单的分类任务。这些图像是视频帧，因此随着时间的推移高度关联，因此在生成巨大的.hdf5文件时，我已经将数据彻底洗了一次...要将数据馈送到CNN中而无需立即将整个集合加载到内存中，我写了一个简单的批处理生成器（请参见下面的代码）。现在我的问题：通常，建议在每个训练时期之后将数据洗牌吗？（出于SGD收敛原因吗？）但是要这样做，每个时期都必须加载整个数据集并进行洗牌，这正是我想避免使用批处理生成器...因此：在每个时期之后洗牌真的很重要吗？这是我批处理生成器的当前代码：

def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])
while 1:
    # count how many entries we have read
    n_entries = 0
    # as long as we haven't read all entries from the file: keep reading
    while n_entries < (filesize - batch_size):
        # start the next batch at index 0
        # create numpy arrays of input data (features)
        xs = hdf5_file['images'][n_entries: n_entries + batch_size]
        xs = np.reshape(xs, dimensions).astype('float32')
        # and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
        y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
        #ys = keras.utils.to_categorical(y_values, num_classes)
        ys = to_categorical(y_values, num_classes)
        # we have read one more batch from this file
        n_entries += batch_size
        yield (xs, ys)

是的，改组会提高性能

不要洗整个数据。在数据中创建索引列表，然后将其改组。然后在索引列表上顺序移动，并使用其值从数据集中挑选数据。

相关内容

最新更新

热门标签：