顶点AI -部署失败



我试图使用自定义容器部署我的自定义训练模型,即从我创建的模型创建端点。我正在用AI平台做同样的事情(相同的模型&;

第一次尝试时,我成功地部署了模型,但从那以后,每当我尝试创建端点时,它都会显示"正在部署";1 +小时然后失败有以下错误:

google.api_core.exceptions.FailedPrecondition: 400 Error: model server never became ready. Please validate that your model file or container configuration are valid. Model server logs can be found at (link)

日志显示如下:

* Running on all addresses (0.0.0.0)
WARNING: This is a development server. Do not use it in a production deployment.
* Running on http://127.0.0.1:8080
[05/Jul/2022 12:00:37] "[33mGET /v1/endpoints/1/deployedModels/2025850174177280000 HTTP/1.1[0m" 404 -
[05/Jul/2022 12:00:38] "[33mGET /v1/endpoints/1/deployedModels/2025850174177280000 HTTP/1.1[0m" 404 -

最后一行被滥发,直到它最终失败。

我的flask应用如下:

import base64
import os.path
import pickle
from typing import Dict, Any
from flask import Flask, request, jsonify
from streamliner.models.general_model import GeneralModel
class Predictor:
def __init__(self, model: GeneralModel):
self._model = model
def predict(self, instance: str) -> Dict[str, Any]:
decoded_pickle = base64.b64decode(instance)
features_df = pickle.loads(decoded_pickle)
prediction = self._model.predict(features_df).tolist()
return {"prediction": prediction}
app = Flask(__name__)
with open('./model.pkl', 'rb') as model_file:
model = pickle.load(model_file)
predictor = Predictor(model=model)

@app.route("/predict", methods=['POST'])
def predict() -> Any:
if request.method == "POST":
instance = request.get_json()
instance = instance['instances'][0]
predictions = predictor.predict(instance)
return jsonify(predictions)

@app.route("/health")
def health() -> str:
return "ok"

if __name__ == '__main__':
port = int(os.environ.get("PORT", 8080))
app.run(host='0.0.0.0', port=port)

我通过Python做的部署代码是不相关的,因为当我通过GCP的UI部署时问题仍然存在。

模型创建代码如下:

def upload_model(self):
model = {
"name": self.model_name_on_platform,
"display_name": self.model_name_on_platform,
"version_aliases": ["default", self.run_id],
"container_spec": {
"image_uri": f'{REGION}-docker.pkg.dev/{GCP_PROJECT_ID}/{self.repository_name}/{self.run_id}',
"predict_route": "/predict",
"health_route": "/health",
},
}
parent = self.model_service_client.common_location_path(project=GCP_PROJECT_ID, location=REGION)
model_path = self.model_service_client.model_path(project=GCP_PROJECT_ID,
location=REGION,
model=self.model_name_on_platform)
upload_model_request_specifications = {'parent': parent, 'model': model,
'model_id': self.model_name_on_platform}
try:
print("trying to get model")
self.get_model(model_path=model_path)
except NotFound:
print("didn't find model, creating a new one")
else:
print("found an existing model, creating a new version under it")
upload_model_request_specifications['parent_model'] = model_path
upload_model_request = model_service.UploadModelRequest(upload_model_request_specifications)
response = self.model_service_client.upload_model(request=upload_model_request, timeout=1800)
print("Long running operation:", response.operation.name)
upload_model_response = response.result(timeout=1800)
print("upload_model_response:", upload_model_response)

我的问题和这个非常相似,不同的是我做了健康检查。

为什么它在第一次部署时可以工作,然后就失败了?为什么它可以在AI平台上运行,但却不能在Vertex AI上运行?

这个问题可能是由不同的原因引起的:

  1. 验证容器配置端口,它应该使用端口8080。这个配置很重要,因为Vertex AI发送的是活动性上的此端口的检查、健康检查和预测请求容器。你可以看到这个关于容器的文档,还有这个其他关于自定义容器的信息

  2. 另一个可能的原因是配额限制,这可能需要增加。您可以使用此文档来验证这一点

  3. 在运行状况和预测路由中使用您正在使用的MODEL_NAME。像下面这个例子

"predict_route": "/v1/models/MODEL_NAME:predict",
"health_route": "/v1/models/MODEL_NAME",
  1. 验证您正在使用的帐户是否有足够的权限读取项目的GCS桶

  2. 验证模型位置,应该是正确的路径。

如果上述任何建议有效,则需要通过创建支持案例来联系GCP支持以修复它。如果不使用内部GCP资源,社区是不可能排除故障的

如果您还没有找到解决方案,您可以尝试自定义预测例程。它们真的很有帮助,因为它们消除了编写代码的服务器部分的必要性,并允许我们专注于ml模型的逻辑和任何类型的预处理或后处理。下面是帮助你的链接https://codelabs.developers.google.com/vertex-cpr-sklearn#0。希望对你有帮助。

最新更新