Sagemaker培训作业失败-在执行用户脚本时出现问题



我是AWS Sagemaker的新手,正在尝试将我的SKLearn脚本部署到一个端点,以便我可以在Android应用程序中调用它。我在这里遵循代码,到目前为止,让每个块与我的脚本一起工作已经成功了。给我带来问题的区块是

sklearn_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
TrainingJobName=sklearn_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]
print("Model artifact persisted at " + artifact)

具体来说,第一行。当我运行这个块时,我得到了这个错误:

UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-54-65920860bce1> in <module>
----> 1 sklearn_estimator.latest_training_job.wait(logs="None")
2 artifact = sm_boto3.describe_training_job(
3     TrainingJobName=sklearn_estimator.latest_training_job.name
4 )["ModelArtifacts"]["S3ModelArtifacts"]
5 
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
1994             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1995         else:
-> 1996             self.sagemaker_session.wait_for_job(self.job_name)
1997 
1998     def describe(self):
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in wait_for_job(self, job, poll)
3217             lambda last_desc: _train_done(self.sagemaker_client, job, last_desc), None, poll
3218         )
-> 3219         self._check_job_status(job, desc, "TrainingJobStatus")
3220         return desc
3221 
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3381                 message=message,
3382                 allowed_statuses=["Completed", "Stopped"],
-> 3383                 actual_status=status,
3384             )
3385 
UnexpectedStatusException: Error for Training job rf-scikit-2022-08-05-22-32-08-239: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
train(environment.Environment())
File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
runner_type=runner.ProcessRunnerType)
File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
wait, capture_error
File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
cwd=environment.code_dir,
File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python SageMaker_Script.py"
ExecuteUse

SageMaker_Script.py是我的脚本的名称。我的脚本中的相关代码是:

if __name__ =='__main__':
print('extracting arguments')
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])

# Data, model, and output directories
parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
parser.add_argument("--train-file", type=str, default="jumpstrain.csv")
parser.add_argument("--test-file", type=str, default="jumpstest.csv")
args, _ = parser.parse_known_args()
print('reading data')
train_df = pd.read_csv(os.path.join(args.train, args.train_file))
test_df = pd.read_csv(os.path.join(args.test, args.test_file))
print('building training and testing datasets')
X_train = train_df[columns]
X_test = test_df[columns]
y_train = train_df[['Under-rotated']]
y_test = test_df[['Under-rotated']]
print('training model')
model = RandomForestClassifier(n_estimators = 100)
model.fit(X_train, y_train)
print('validating model')
pred_values = model.predict(X_test[columns])
print('f1-score:')
f1score = f1_score(y_test, pred_values)
print(f1score)
# persist model
path = os.path.join(args.model_dir, 'model.joblib')
joblib.dump(model, path)
print('model persisted at ' + path)
print(args.min_samples_leaf)

我不知道这个问题是什么,因为正如我所说,我对AWS总体上来说是个新手,它给我的错误信息并不是很丰富。如有任何帮助,我们将不胜感激。

您的培训脚本中似乎有错误,为了开始,我建议您按照此分步指南为SKlearn Model创建培训脚本和作业-https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#train-a-model-with-scikit学习

最新更新