将在谷歌云AI平台上训练的TensorFlow模型保存到谷歌云存储桶时,没有此类对象错误



我正在谷歌云的AI平台上使用TensorFlow训练一个模型,虽然训练本身进行得很好,但我无法将完成的模型以SavedModel格式保存到我的云存储桶中。我知道存储桶设置正确,因为在训练开始时,我会从同一个存储桶下载训练数据。这是我用来保存模型的代码:

SAVE_PATH = os.path.join("gs://", 'machine-learning-ebay', 'job-dir')
linear_model.save(SAVE_PATH)

其中"机器学习ebay"是存储桶,"作业目录"是该存储桶中的文件夹。

我在谷歌云中的工作描述页面上收到以下错误:

Traceback (most recent call last):
[...]
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1219, in save
file_prefix_tensor, object_graph_tensor, options)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1164, in _save_cached_when_graph_building
save_op = saver.save(file_prefix, options=options)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 300, in save
return save_fn()
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 287, in save_fn
sharded_prefixes, file_prefix, delete_old_dirs=True)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 504, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name, ctx=_ctx)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 528, in merge_v2_checkpoints_eager_fallback
attrs=_attrs, ctx=ctx, name=name)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{
"error": {
"code": 404,
"message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
"errors": [
{
"message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
"domain": "global",
"reason": "notFound"
}
]
}
}

任何帮助都将不胜感激;这个项目的截止日期是今天。

遵循谷歌培训示例中的代码(https://github.com/GoogleCloudPlatform/cloudml-samples/blob/main/census/tf-keras/trainer/task.py)GitHub的一个问题是,对输出文件夹进行时间戳可以解决覆盖问题(https://github.com/kubeflow/pipelines/issues/2171),我将我的导出代码更改为以下代码:

current_time = now.strftime("%H.%M.%S")
tf.compat.v1.keras.experimental.export_saved_model(linear_model,'gs://machine-learning-ebay/job-dir/keras-export'+current_time)  

这解决了我面临的训练错误,成功地导出了模型。

相关内容

  • 没有找到相关文章

最新更新