我使用推荐人的https://github.com/microsoft/recommenders
库来训练一个NCF推荐模型。目前,我在通过Amazon TensorflowModel库部署时遇到了问题
使用以下代码
保存模型def save(self, dir_name):
"""Save model parameters in `dir_name`
Args:
dir_name (str): directory name, which should be a folder name instead of file name
we will create a new directory if not existing.
"""
# save trained model
if not os.path.exists(dir_name):
os.makedirs(dir_name)
saver = tf.compat.v1.train.Saver()
saver.save(self.sess, os.path.join(dir_name, MODEL_CHECKPOINT))
该过程中导出的文件为'checkpoint', 'model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta'
它们遵循
- model.tar.gz
- 00000000
- checkpoint
- model.ckpt.data-00000-of-00001
- model.ckpt.index
- model.ckpt.meta
我尝试了各种部署过程,但是它们都给出了相同的错误。这是我在这个例子之后实现的最新一个https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-script-mode/pytorch_bert/code/inference_code.py
from sagemaker.tensorflow.model import TensorFlowModel
model = TensorFlowModel(
entry_point="tf_inference.py",
model_data=zipped_model_path,
role=role,
model_version='1',
framework_version="2.7"
)
predictor = model.deploy(
initial_instance_count=1, instance_type="ml.g4dn.2xlarge", endpoint_name='endpoint-name3'
)
所有的解决方案以相同的错误反复结束
Traceback (most recent call last):
File "/sagemaker/serve.py", line 502, in <module>
ServiceManager().start()
File "/sagemaker/serve.py", line 482, in start
self._create_tfs_config()
File "/sagemaker/serve.py", line 153, in _create_tfs_config
raise ValueError("no SavedModel bundles found!")
这两个链接帮我解决了这个问题
- https://github.com/aws/sagemaker-python-sdk/issues/599
- https://www.tensorflow.org/guide/migrate/saved_model # 1 _save_the_graph_as_a_savedmodel_with_savedmodelbuilder
Sagemaker有奇怪的目录结构,您需要严格遵循。第一个共享起始目录,第二个共享为TF1和TF2保存模型的过程