无法使用自定义预测例程将经过训练的模型部署到 Google Cloud AI 平台:模型需要的内存多于允许的内存



我正在尝试使用自定义预测例程将预训练的pytorch模型部署到AI平台。按照此处所述的说明操作后,部署失败并显示以下错误:

ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.

模型文件夹的内容大小为83.89MB,低于文档中描述的250 MB限制。文件夹中的唯一文件是模型的检查点文件 (.pth( 和自定义预测例程所需的压缩包。

创建模型的命令:

gcloud beta ai-platform versions create pose_pytorch --model pose --runtime-version 1.15 --python-version 3.5 --origin gs://rcg-models/pytorch_pose_estimation --package-uris gs://rcg-models/pytorch_pose_estimation/my_custom_code-0.1.tar.gz --prediction-class predictor.MyPredictor

将运行时版本更改为1.14会导致相同的错误。 我尝试像 Parth 建议的那样将 --machine 类型的参数更改为mls1-c4-m2,但我仍然收到相同的错误。

生成my_custom_code-0.1.tar.gzsetup.py文件如下所示:

setup(
name='my_custom_code',
version='0.1',
scripts=['predictor.py'],
install_requires=["opencv-python", "torch"]
)

预测器中的相关代码片段:

def __init__(self, model):
"""Stores artifacts for prediction. Only initialized via `from_path`.
"""
self._model = model
self._client = storage.Client()
@classmethod
def from_path(cls, model_dir):
"""Creates an instance of MyPredictor using the given path.
This loads artifacts that have been copied from your model directory in
Cloud Storage. MyPredictor uses them during prediction.
Args:
model_dir: The local directory that contains the trained Keras
model and the pickled preprocessor instance. These are copied
from the Cloud Storage model directory you provide when you
deploy a version resource.
Returns:
An instance of `MyPredictor`.
"""
net = PoseEstimationWithMobileNet()
checkpoint_path = os.path.join(model_dir, "checkpoint_iter_370000.pth")
checkpoint = torch.load(checkpoint_path, map_location='cpu')
load_state(net, checkpoint)
return cls(net)

此外,我还在 AI 平台中为模型启用了日志记录,并得到以下输出:

2019-12-17T09:28:06.208537Z OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k 
2019-12-17T09:28:13.474653Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:48: The name tf.saved_model.tag_constants.SERVING is deprecated. Please use tf.saved_model.SERVING instead. 
2019-12-17T09:28:13.474680Z {"textPayload":"","insertId":"5df89fad00073e383ced472a","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474680Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:13.474807Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:50: The name tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY is deprecated. Please use tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY instead. 
2019-12-17T09:28:13.474829Z {"textPayload":"","insertId":"5df89fad00073ecd4836d6aa","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474829Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:13.474918Z WARNING:tensorflow: 
2019-12-17T09:28:13.474927Z The TensorFlow contrib module will not be included in TensorFlow 2.0. 
2019-12-17T09:28:13.474934Z For more information, please see: 
2019-12-17T09:28:13.474941Z   * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md 
2019-12-17T09:28:13.474951Z   * https://github.com/tensorflow/addons 
2019-12-17T09:28:13.474958Z   * https://github.com/tensorflow/io (for I/O related ops) 
2019-12-17T09:28:13.474964Z If you depend on functionality not listed there, please file an issue. 
2019-12-17T09:28:13.474999Z {"textPayload":"","insertId":"5df89fad00073f778735d7c3","resource":{"type":"cloudml_model_version","labels":{"version_id":"lightweight_pose_pytorch","model_id":"pose","project_id":"rcg-shopper","region":""}},"timestamp":"2019-12-17T09:28:13.474999Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:15.283483Z ERROR:root:Failed to import GA GRPC module. This is OK if the runtime version is 1.x 
2019-12-17T09:28:16.890923Z Copying gs://cml-489210249453-1560169483791188/models/pose/lightweight_pose_pytorch/15316451609316207868/user_code/my_custom_code-0.1.tar.gz... 
2019-12-17T09:28:16.891150Z / [0 files][    0.0 B/  8.4 KiB]                                                 
2019-12-17T09:28:17.007684Z / [1 files][  8.4 KiB/  8.4 KiB]                                                 
2019-12-17T09:28:17.009154Z Operation completed over 1 objects/8.4 KiB.                                       
2019-12-17T09:28:18.953923Z Processing /tmp/custom_code/my_custom_code-0.1.tar.gz 
2019-12-17T09:28:19.808897Z Collecting opencv-python 
2019-12-17T09:28:19.868579Z   Downloading https://files.pythonhosted.org/packages/d8/38/60de02a4c9013b14478a3f681a62e003c7489d207160a4d7df8705a682e7/opencv_python-4.1.2.30-cp37-cp37m-manylinux1_x86_64.whl (28.3MB) 
2019-12-17T09:28:21.537989Z Collecting torch 
2019-12-17T09:28:21.552871Z   Downloading https://files.pythonhosted.org/packages/f9/34/2107f342d4493b7107a600ee16005b2870b5a0a5a165bdf5c5e7168a16a6/torch-1.3.1-cp37-cp37m-manylinux1_x86_64.whl (734.6MB) 
2019-12-17T09:28:52.401619Z Collecting numpy>=1.14.5 
2019-12-17T09:28:52.412714Z   Downloading https://files.pythonhosted.org/packages/9b/af/4fc72f9d38e43b092e91e5b8cb9956d25b2e3ff8c75aed95df5569e4734e/numpy-1.17.4-cp37-cp37m-manylinux1_x86_64.whl (20.0MB) 
2019-12-17T09:28:53.550662Z Building wheels for collected packages: my-custom-code 
2019-12-17T09:28:53.550689Z   Building wheel for my-custom-code (setup.py): started 
2019-12-17T09:28:54.212558Z   Building wheel for my-custom-code (setup.py): finished with status 'done' 
2019-12-17T09:28:54.215365Z   Created wheel for my-custom-code: filename=my_custom_code-0.1-cp37-none-any.whl size=7791 sha256=fd9ecd472a6a24335fd24abe930a4e7d909e04bdc4cf770989143d92e7023f77 
2019-12-17T09:28:54.215482Z   Stored in directory: /tmp/pip-ephem-wheel-cache-i7sb0bmb/wheels/0d/6e/ba/bbee16521304fc5b017fa014665b9cae28da7943275a3e4b89 
2019-12-17T09:28:54.222017Z Successfully built my-custom-code 
2019-12-17T09:28:54.650218Z Installing collected packages: numpy, opencv-python, torch, my-custom-code 

这是一个常见问题,我们知道这是一个痛点。请执行以下操作:

  1. torchvisiontorch作为依赖项,默认情况下,它从 pypi 中提取torch

部署模型时,即使您指向使用自定义 ai 平台torchvision包,它也会这样做,因为torchvision何时由 PyTorch 团队构建,它被配置为使用torch作为依赖项。这个torch来自 pypi 的依赖项,给出了一个 720mb 的文件,因为它包含 GPU 单元

  1. 要解决 #1,您需要从源代码构建torchvision并告诉torchvision你想从哪里获取torch,您需要将其设置为转到torch网站,因为包较小。使用 Python PEP-0440 直接引用功能重建torchvision二进制文件。torchvisionsetup.py 我们有:
pytorch_dep = 'torch'
if os.getenv('PYTORCH_VERSION'):
pytorch_dep += "==" + os.getenv('PYTORCH_VERSION')

更新torchvision中的setup.py以使用直接引用功能:

requirements = [
#'numpy',
#'six',
#pytorch_dep,
'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl'
]

* 我已经为您完成了此操作*,所以我构建了您可以使用的 3 轮文件:

gs://dpe-sandbox/torchvision-0.4.0-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.0)
gs://dpe-sandbox/torchvision-0.4.2-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.2)
gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl (torch 1.4.0  vision 0.5.0)

这些torchvision包将从火炬现场而不是 pypi 获取torch:(例如:https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl(

  1. 将模型部署到 AI 平台时,setup.py更新模型,使其不包含torch,也不包含torchvision

  2. 按如下所示重新部署模型:

PYTORCH_VISION_PACKAGE=gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl
gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} 
--origin=gs://{BUCKET}/{GCS_MODEL_DIR} 
--python-version=3.7 
--runtime-version={RUNTIME_VERSION} 
--machine-type=mls1-c4-m4 
--package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI},{PYTORCH_VISION_PACKAGE}
--prediction-class={MODEL_CLASS}

您可以将PYTORCH_VISION_PACKAGE更改为我在#2中提到的任何选项

我可以通过调整setup.py来成功。基本上install_requires尝试获取 PyPI 托管的torch包,这是一个巨大的 GPU 构建轮子,超出了部署配额。以下setup.py注入了从官方 pytorch 索引中获取 CPU 构建的火炬的安装命令。

from setuptools import setup, find_packages
from setuptools.command.install import install as _install
INSTALL_REQUIRES = ['pillow']
CUSTOM_INSTALL_COMMANDS = [
# Install torch here.
[
'python-default', '-m', 'pip', 'install', '--target=/tmp/custom_lib',
'-b', '/tmp/pip_builds', 'torch==1.4.0+cpu', 'torchvision==0.5.0+cpu',
'-f', 'https://download.pytorch.org/whl/torch_stable.html'
],
]
class Install(_install):
def run(self):
import sys
if sys.platform == 'linux':
import subprocess
import logging
for command in CUSTOM_INSTALL_COMMANDS:
logging.info('Custom command: ' + ' '.join(command))
result = subprocess.run(
command, check=True, stdout=subprocess.PIPE
)
logging.info(result.stdout.decode('utf-8', 'ignore'))
_install.run(self)
setup(
name='predictor',
version='0.1',
packages=find_packages(),
install_requires=INSTALL_REQUIRES,
cmdclass={'install': Install},
)

经过数小时的旧试验错误,我得出了与@kyamagu相同的结论,"install_requires尝试获取 PyPI 托管的火炬包,这是一个巨大的 GPU 构建轮子,超出了部署配额。

但是,他的解决方案对我不起作用。因此,经过几个小时的试验错误(由于缺乏文档和错误的文档(,我想出了这个解决方案:

我们需要获得 Pytorch 的 CPU 构建轮子,大约是 100 MB,而不是默认托管的 700 MB GPU 构建的轮子。 你可以在这里找到它们:https://download.pytorch.org/whl/cpu/torch_stable.html

接下来,我们需要将它们放在我们的 gs 存储中,然后将路径作为 --package-uri 的一部分给出,如下所示:

gcloud beta ai-platform versions create v17 
--model=newest 
--origin=gs://bucket 
--runtime-version=1.15 
--python-version=3.7 
--package-uris=gs://bucket/predictor-0.1.tar.gz,gs://bucket/torch-1.3.0+cpu-cp37-cp37m-linux_x86_64.whl 
--prediction-class=predictor.MyPredictor 
--machine-type=mls1-c4-m4

另外,请注意package-uris的顺序,predictor包应该是第一个,逗号后不应该有任何空格。

希望这有帮助。 干杯!

最新更新