PyTorch Lightning with Amazon SageMaker

我们目前使用Pytorch Lightning在SageMaker之外进行训练。希望使用SageMaker来利用分布式训练、检查点、模型训练优化(训练编译器(等来加速训练过程并节省成本。建议用什么方法迁移PyTorch Lightning脚本以在SageMaker上运行？

使用SageMaker运行PyTorch Lightning和普通PyTorch脚本没有太大区别。

然而，当使用DDPPlugin运行分布式训练作业时，需要注意的一点是在脚本开始时正确设置NODE_RANK环境变量，因为PyTorch Lightning对SageMaker环境变量一无所知，并且依赖于通用集群变量：

os.environ["NODE_RANK"] = str(int(os.environ.get("CURRENT_HOST", "algo-1")[5:]) - 1)

或(更稳健(：

rc = json.loads(os.environ.get("SM_RESOURCE_CONFIG", "{}"))
os.environ["NODE_RANK"] = str(rc["hosts"].index(rc["current_host"]))

由于您的问题是特定于将已经工作的代码迁移到Sagemaker的，请使用此处的链接作为参考，我可以尝试将流程分解为3部分：

创建Pytorch估计器-estimator

import sagemaker
sagemaker_session = sagemaker.Session()
pytorch_estimator = PyTorch(
entry_point='my_model.py',
instance_type='ml.g4dn.16xlarge',
instance_count=1,
framework_version='1.7',
py_version='py3',
output_path: << s3 bucket >>,
source_dir = <<  path for my_model.py >> ,
sagemaker_session=sagemaker_session)

entry_point = "my_model.py"-这部分应该是您现有的Pytorch Lightning脚本。在main方法中，您可以有如下内容：

if __name__ ==  '__main__':
import pytorch_lightning as pl
trainer = pl.Trainer(
devices=-1, ## in order to utilize all GPUs
accelerator="gpu", 
strategy="ddp", 
enable_checkpointing=True, 
default_root_dir="/opt/ml/checkpoints",
)

model=estimator.fit()

此外，这里的链接很好地解释了编码过程。https://vision.unipv.it/events/Bianchi_giu2021-Introduction-PyTorch-Lightning.pdf

在SageMaker上运行Pytorch Lightning最简单的方法是使用SageMaker Pytorch估计器(示例(开始。理想情况下，您必须添加一个requirement.txt，用于安装pytorch闪电以及源代码。

关于分布式培训，亚马逊SageMaker最近推出了对运行Pytorch闪电分布式培训的本地支持。请按照下面的链接设置您的培训代码

https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt-lightning.html

https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/

相关内容

最新更新

热门标签：