在AWS Sagemaker上培训YoloV5时遇到的问题| AlgorithmError:, exit code: 1



我试图通过Docker映像(ECR)在AWS Sagemaker上训练YoloV5与自定义数据(存储在S3中),我不断获得&;AlgorithmError:,退出代码:1&;。有人能告诉我如何调试这个问题吗?

这是Docker的图片:

# GET THE AWS IMAGE
FROM 763104351884.dkr.ecr.eu-west-3.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
# UPDATES
RUN apt update
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt install -y tzdata
RUN apt install -y python3-pip git zip curl htop screen libgl1-mesa-glx libglib2.0-0
RUN alias python=python3
# INSTALL REQUIREMENTS
COPY requirements.txt .
RUN python3 -m pip install --upgrade pip
RUN pip install --no-cache -r requirements.txt albumentations gsutil notebook 
coremltools onnx onnx-simplifier onnxruntime openvino-dev tensorflow-cpu tensorflowjs


COPY code /opt/ml/code
WORKDIR /opt/ml/code

RUN git clone https://github.com/ultralytics/yolov5 /opt/ml/code/yolov5
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM trainYolo.py

ENTRYPOINT ["python", "trainYolo.py"]

这里是trainYolo.py:


import json 
import os
import numpy as np
import cv2 as cv
import subprocess
import yaml
import shutil

trainSet = os.environ["SM_CHANNEL_TRAIN"]
valSet = os.environ["SM_CHANNEL_VAL"]
output_dir = os.environ["SM_CHANNEL_OUTPUT"]
#Creating the data.yaml for yolo
dict_file = [{'names' : ['block']},
{'nc' : ['1']}, {'train': [trainSet]}
, {'val': [valSet]}]
with open(r'data.yaml', 'w') as file:
documents = yaml.dump(dict_file, file)


#Execute this command to train Yolo
res = subprocess.run(["python3", "yolov5/train.py",  "--batch", "16" "--epochs", "100", "--data", "data.yaml", "--cfg", "yolov5/models/yolov5s.yaml","--weights", "yolov5s.pt"  "--cache"], shell=True)

shutil.copy("yolov5", output_dir)

注意:我不确定subprocess.run()是否在Sagemaker等环境中工作。

谢谢

所以你的训练脚本没有正确配置。当使用SageMaker估计器或脚本模式时,您必须将其配置为能够正确保存模型的格式。下面是一个使用TensorFlow和脚本模式的示例笔记本。如果你想构建自己的Dockerfile (Bring your own Container),那么你必须配置你的train文件,如第二个链接所示。

脚本模式下:https://github.com/RamVegiraju/SageMaker-Deployment/tree/master/RealTime/Script-Mode/TensorFlow/Classification

BYOC: https://github.com/RamVegiraju/SageMaker-Deployment/tree/master/RealTime/BYOC/Sklearn/Sklearn-Regressor/container/randomForest

最新更新