ValueError:使用ModelCheckpoint保存我的模型时无法创建数据集(名称已经存在)



我正在尝试运行Keras官方代码示例"使用Swin transformer进行图像分类"。代码一开始工作得很好,但在我添加了一个ModelCheckpoint以将hdf5模型保存在模型的回调参数中之后。{即合适的方法。model.fit(…, callbacks=[ModelCheckpoint(…)],…,)},我收到以下错误[ValueError: Unable to create dataset (name已经存在)]。什么是"名字"?参考这里?我该如何解决这个问题?

我在我的本地设备(windows10, tensorflow2.8.0)和Google Colab(tensorflow2.8.2)上运行代码,都得到上述错误。

完整的代码示例可以在这里找到[https://keras]。我的代码和代码示例之间的唯一区别是我为ModelCheckpoint添加了一行代码。添加的代码和错误信息的位置如下所示。

代码片段:

model = keras.Model(input, output)
model.compile(
loss=keras.losses.CategoricalCrossentropy(label_smoothing=label_smoothing),
optimizer=tfa.optimizers.AdamW(
learning_rate=learning_rate, weight_decay=weight_decay
),
metrics=[
keras.metrics.CategoricalAccuracy(name="accuracy"),
keras.metrics.TopKCategoricalAccuracy(5, name="top-5-accuracy"),
],
)
history = model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=num_epochs,
validation_split=validation_split,
# 👇 I added one line of code
callbacks = keras.callbacks.ModelCheckpoint('lowest_loss.hdf5', monitor='loss', verbose=0, save_best_only=True, save_weights_only=True)
)

这是我得到的错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-c96b13609516> in <module>()
18     validation_split=validation_split,
19     # 👇 I added one line of code
---> 20     callbacks = keras.callbacks.ModelCheckpoint('lowest_loss.hdf5', monitor='loss', verbose=0, save_best_only=True, save_weights_only=True)
21 )
2 frames
/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65     except Exception as e:  # pylint: disable=broad-except
66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
68     finally:
69       del filtered_tb
/usr/local/lib/python3.7/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
146                     group = self.require_group(parent_path)
147 
--> 148             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
149             dset = dataset.Dataset(dsid)
150             return dset
/usr/local/lib/python3.7/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
135 
136 
--> 137     dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
138 
139     if (data is not None) and (not isinstance(data, Empty)):
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.create()
ValueError: Unable to create dataset (name already exists)

最有可能的是,当层名称在预训练模型和下游任务网络的名称空间中重复时,会发生此错误。选择一个唯一的名称来称呼下游任务网络的每一层可能会很有用。在你创建的所有图层中添加一个name='some_unique_name'来解决这个问题。

试试这个,它为我工作

for i in range(len(model.weights)):
model.weights[i]._handle_name = model.weights[i].name + "_" + str(i)

我不确定,但这是我创建的代码,用于保存检查点并根据保存的.h5文件名加载最新的检查点,以继续训练Keras模型并避免重复epoch。我希望它能帮助你,因为我有同样的错误,"ValueError:无法创建数据集(名称已经存在)。">

import tensorflow_datasets as tfds
import tensorflow.keras as keras
import os
from tensorflow.keras.callbacks import ModelCheckpoint
# Prepare the dataset
train_ds = (
tf.data.Dataset.from_tensor_slices(dataset)
.batch(16)
.cache()
.prefetch(tf.data.AUTOTUNE)
)
# Define the number of epochs for demonstration
num_epochs = 3
# Directory to save the checkpoints
checkpoint_dir = "checkpoints/"
# Callback to save checkpoints per epoch
checkpoint_callback = ModelCheckpoint(
filepath=os.path.join(checkpoint_dir, "model_{epoch:02d}.h5"),
save_freq="epoch",
save_weights_only=True,
save_best_only=False,
verbose=1
)
# Check if there are existing checkpoints to resume from the last one
checkpoint_files = [file for file in os.listdir(checkpoint_dir) if file.endswith('.h5')]
if checkpoint_files:
latest_checkpoint = max(checkpoint_files)
gpt2_lm.load_weights(os.path.join(checkpoint_dir, latest_checkpoint))
# Get the epoch number from the checkpoint file name
num_epoch_resume = int(latest_checkpoint.split("_")[1].split(".")[0])

# Train the model
history = gpt2_lm.fit(train_ds, initial_epoch=num_epoch_resume, epochs=num_epochs, callbacks=[checkpoint_callback])
print("hello")
else:
# No existing checkpoints, train the model from scratch
history = gpt2_lm.fit(train_ds, epochs=num_epochs, callbacks=[checkpoint_callback])

如果你使用TensorFlow版本2.0或更高版本,你可以试着改变".hdf5"文件到".tf"。遇到同样的问题,我更改了文件扩展名如下:

save_dir = os.path.join(os.getcwd(), "save_models")
filepath = "cnn_cnn_weights.{epoch:02d}-{val_loss:.4f}--0fold.tf"
checkpoint = ModelCheckpoint(os.path.join(save_dir, filepath),
monitor="val_loss", verbose=1, save_best_only=False, mode='min')

最新更新