Tensorflow-IO 数据集输入管道，包含非常大的 HDF5 文件

>我有非常大的训练(30Gb(文件.
由于所有数据都不适合我的可用RAM，我想批量读取数据.
我看到有Tensorflow-io包，它实现了一种以这种方式将HDF5读取到Tensorflow的方法，这要归功于函数tfio.IODataset.from_hdf5()
然后，由于tf.keras.model.fit()将tf.data.Dataset作为包含样本和目标的输入，我需要将我的 X 和 Y 压缩在一起，然后使用.batch and .prefetch在内存中加载必要的数据。对于测试，我尝试将此方法应用于较小的样本：训练(9Gb(，验证(2.5Gb(和测试(1.2Gb(，我知道它们效果很好，因为它们可以放入内存中，并且我有很好的结果(70%的准确率和<1损失(.
训练文件存储在HDF5文件中，拆分为样本(X(和标签(Y(文件，如下所示：

X_learn.hdf5  
X_val.hdf5  
X_test.hdf5  
Y_test.hdf5  
Y_learn.hdf5  
Y_val.hdf5

这是我的代码：

BATCH_SIZE = 2048
EPOCHS = 100
# Create an IODataset from a hdf5 file's dataset object  
x_val = tfio.IODataset.from_hdf5(path_hdf5_x_val, dataset='/X_val')
y_val = tfio.IODataset.from_hdf5(path_hdf5_y_val, dataset='/Y_val')
x_test = tfio.IODataset.from_hdf5(path_hdf5_x_test, dataset='/X_test')
y_test = tfio.IODataset.from_hdf5(path_hdf5_y_test, dataset='/Y_test')
x_train = tfio.IODataset.from_hdf5(path_hdf5_x_train, dataset='/X_learn')
y_train = tfio.IODataset.from_hdf5(path_hdf5_y_train, dataset='/Y_learn')

# Zip together samples and corresponding labels
train = tf.data.Dataset.zip((x_train,y_train)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
test = tf.data.Dataset.zip((x_test,y_test)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
val = tf.data.Dataset.zip((x_val,y_val)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
# Build the model
model = build_model()

# Compile the model with custom learing rate function for Adam optimizer
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=lr_schedule(0)),
metrics=['accuracy'])
# Fit model with class_weights calculated before
model.fit(train,
epochs=EPOCHS,
class_weight=class_weights_train,
validation_data=val,
shuffle=True,
callbacks=callbacks)

这段代码运行，但损失非常高(300+(，精度从一开始就下降到0(0.30 -> 4*e^-5(......我不明白我做错了什么，我错过了什么吗？

在此处提供解决方案(答案部分(，即使它存在于评论部分以造福社区。

代码没有问题，它实际上是数据(未正确预处理(，因此模型无法很好地学习，这导致了奇怪的损失和准确性。

相关内容

最新更新

热门标签：