批量转换作业导致数据文件>100MB "InternalServerError"

>我正在使用Sagemaker对时间序列进行二元分类，每个样本都是形状[24,11](24h，11特征)的numpy数组。我在脚本模式下使用了张量流模型，我的脚本与我用作参考的脚本非常相似： https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/mnist.py

培训报告成功，我能够部署用于批量转换的模型。当我只输入几个样本(例如，[10,24,11])时，转换作业工作正常，但当我输入更多样本进行预测时，它会返回一个InternalServerError(例如，[30000， 24， 11]，大小为>100MB)。

这是错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-0c46f7563389> in <module>()
32 
33 # Then wait until transform job is completed
---> 34 tf_transformer.wait()
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
133     def wait(self):
134         self._ensure_last_transform_job()
--> 135         self.latest_transform_job.wait()
136 
137     def _ensure_last_transform_job(self):
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
207 
208     def wait(self):
--> 209         self.sagemaker_session.wait_for_transform_job(self.job_name)
210 
211     @staticmethod
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_transform_job(self, job, poll)
893         """
894         desc = _wait_until(lambda: _transform_job_status(self.sagemaker_client, job), poll)
--> 895         self._check_job_status(job, desc, 'TransformJobStatus')
896         return desc
897 
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
915             reason = desc.get('FailureReason', '(No reason provided)')
916             job_type = status_key_name.replace('JobStatus', ' job')
--> 917             raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
918 
919     def wait_for_endpoint(self, endpoint, poll=5):
ValueError: Error for Transform job Tensorflow-batch-transform-2019-05-29-02-56-00-477: Failed Reason: InternalServerError: We encountered an internal error.  Please try again.

在部署模型时，我尝试同时使用 SingleRecord 和 MultiRecord 参数，但结果是相同的，所以我决定保留 MultiRecord。我的变压器看起来像这样：

transformer = tf_estimator.transformer(
instance_count=1, 
instance_type='ml.m4.xlarge',
max_payload = 100,
assemble_with = 'Line',
strategy='MultiRecord'
)

起初我使用 json 文件作为转换作业的输入，它抛出了错误：

Too much data for max payload size

所以接下来我尝试了 jsonlines 格式(据我了解不支持 .npy 格式)，认为 jsonlines 可以按行拆分，从而避免大小错误，但这就是我得到InternalServerError的地方。以下是相关代码：

#Convert test_x to jsonlines and save
test_x_list = test_x.tolist()
file_path ='data_cnn_test/test_x.jsonl'
file_name='test_x.jsonl'
with jsonlines.open(file_path, 'w') as writer:
writer.write(test_x_list)    
input_key = 'batch_transform_tf/input/{}'.format(file_name)
output_key = 'batch_transform_tf/output'
test_input_location = 's3://{}/{}'.format(bucket, input_key)
test_output_location = 's3://{}/{}'.format(bucket, output_key)
s3.upload_file(file_path, bucket, input_key)
# Initialize the transformer object
tf_transformer = sagemaker.transformer.Transformer(
base_transform_job_name='Tensorflow-batch-transform',
model_name='sagemaker-tensorflow-scriptmode-2019-05-29-02-46-36-162',
instance_count=1,
instance_type='ml.c4.2xlarge',
output_path=test_output_location,
assemble_with = 'Line'
)
# Start the transform job
tf_transformer.transform(test_input_location, content_type='application/jsonlines', split_type='Line')

名为 test_x_list 的列表有一个形状 [30000， 24， 11]，对应于 30000 个样本，因此我想返回 30000 个预测。

我怀疑我的 jsonlines 文件没有被 Line 拆分，当然太大而无法在一个批次中处理，这引发了错误，但我不明白为什么它没有正确拆分。我使用的是默认的output_fn和input_fn(我没有在我的脚本中重写这些函数)。

任何关于我可能做错什么的见解将不胜感激。

我认为这是这篇AWS论坛帖子的副本：https://forums.aws.amazon.com/thread.jspa?threadID=303810&tstart=0

无论如何，为了完整起见，我也会在这里回答。

问题是在将数据集转换为 jsonlines 时，您错误地序列化数据集：

test_x_list = test_x.tolist()
...
with jsonlines.open(file_path, 'w') as writer:
writer.write(test_x_list)

上面所做的是创建一个非常大的单行，其中包含您的完整数据集，该数据集太大，单个推理调用无法使用。

我建议您更改代码，使每一行都成为单个样本，以便可以在单个样本而不是整个数据集上进行推理：

test_x_list = test_x.tolist()
...
with jsonlines.open(file_path, 'w') as writer:
for sample in test_x_list:
writer.write(sample)

如果一次一个样本太慢，您还可以使用max_concurrent_transforms、strategy和max_payload参数来批处理数据，以及运行并发转换(如果您的算法可以并行运行) - 当然，您也可以将数据拆分为多个文件并使用多个节点运行转换。有关这些参数的作用的更多详细信息，请参阅 https://sagemaker.readthedocs.io/en/latest/transformer.html 和 https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html。

相关内容

最新更新

热门标签：