在谷歌人工智能平台引擎上提交tensorflow2作业时内存不足

我正试图在谷歌AI平台引擎上提交一份带有gcloud的Tensorflow2培训工作(微调对象检测模型(。我的数据集不大(浣熊数据集，大约有10M(。我尝试了很多配置，但每次都会出现相同的错误：

The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL)

我的命令：

gcloud ai-platform jobs submit training OD_ssd_fpn_large 
--job-dir=gs://${MODEL_DIR} 
--package-path ./object_detection 
--module-name object_detection.model_main_tf2 
--region us-east1 
--config cloud.yml 
--  
--model_dir=gs://${MODEL_DIR} 
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

我最后一次尝试cloud.yml文件涉及大型模型：

trainingInput:
runtimeVersion: "2.2"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: large_model
workerCount: 5
workerType: large_model
parameterServerCount: 3
parameterServerType: large_model

但是总是相同的错误。任何提示或帮助都非常感谢

读取所有数据需要消耗RAM，因此内存不足。您需要获得更大的实例类型(large_model或complex_model_l；有关机器类型的更多详细信息，请参阅本文档(。

trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
workerType: n1-highcpu-16
parameterServerType: n1-highmem-8
evaluatorType: n1-highcpu-16
workerCount: 9
parameterServerCount: 3
evaluatorCount: 1

或者您需要减少数据集。

相关内容

最新更新

热门标签：