Amazon SageMaker ScriptMode CUDA组件的长Python轮子构建时间

我使用PyTorch估计器和SageMaker在多GPU机器上训练/微调我的图形神经网络。

安装到Estimator容器中的requirements.txt具有以下行：

torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

当SageMaker在端点的Estimator中安装这些要求时，构建轮子需要~2小时。在本地Linux上只需几秒钟。

SageMaker估计器：

PyTorch v1.10CUDA 11.xPython 3.8实例：ml.p3.16xlarge

我注意到其他需要CUDA的基于车轮的部件也存在同样的问题。

我还尝试在p3.16xlarge上构建一个Docker容器，并在SageMaker上运行，但它无法识别实例GPU

我能做些什么来减少这些构建时间吗？

Pip为包安装需要[compiled][1]，这需要时间。不确定，但在您的本地实例上，它可能是第一次构建的。一种解决方法是用以下内容扩展基础[container][2](一次性成本(，并将其用于SageMaker Estimator

添加
/requirements.txt
/tmp/packages/

RUN python-m pip install--no cache dir-r/tmp/packages/requirements.txt[1] ：https://github.com/rusty1s/pytorch_scatter/blob/master/setup.py[2] ：https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/1.10/py3/cu113/Dockerfile.sagemaker.gpu

解决方案是用正确的组件增强股票估计器图像，然后可以在SageMaker脚本模式下运行：

FROM    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.tx

关键是要确保在构建时使用nvidia运行时，因此需要相应地配置daemon.json：

{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}

这仍然不是一个完整的解决方案，因为SageMaker构建的可行性取决于执行构建的主机。

相关内容

最新更新

热门标签：