ResumeableUploadAbortException:上载完成，流中还剩下1141101995个额外字节

我正在使用GCP Vertex platform进行分布式培训。使用Pytorch和HuggingFace的4个GPU并行训练模型。训练后，当我将模型从本地container保存到GCP bucket时，它会向我抛出错误。

这是代码：

我以这种方式启动train.py：

python -m torch.distributed.launch --nproc_per_node 4  train.py

训练完成后，我用这个保存模型文件。它有3个文件需要保存。

trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0  cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP

错误：

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded

有时我会遇到这样的错误：

ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.

根据文档名称冲突，您试图覆盖已经创建的文件。

因此，我建议您在每次训练中使用唯一标识符更改命运位置，这样您就不会收到这种类型的错误。例如，在bucket的末尾添加字符串格式的时间戳，如：

- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000

我想提到的是，这种错误是可重试的，正如错误文档错误文档中提到的那样。

我也遇到了这个问题。当rsync上传文件时，文件内容发生变化时，就会出现这种情况。这种情况可能发生在大文件中，因为不能保证文件写入是事务性的。

我只需重试gsutil rsync命令就解决了这个问题。

相关内容

最新更新

热门标签：