TensorFlow中的内存泄漏Google Cloud ML培训

我一直在Google Cloud ML上尝试TensorFlow教程脚本。特别是我在https://github.com/tensorflow/models/master/master/master/master/tutorials/image/cifar10.

上使用了CIFAR10 CNN教程脚本。

当我在Google Cloud ML中运行此培训脚本时，记忆泄漏每小时约0.5％。

除了将它们包装到所需的GCP格式之外，我没有对脚本进行任何更改(如https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer-将数据位置设置为包含.bin数据文件的存储桶。

如果我在本地运行，即不在Google Cloud，中，并通过设置LD_PRELOAD ="/usr/lib/libtcmalloc.oso"，请使用TCMALLOC >"，则解决了内存泄漏。但是，我没有Google Cloud ML的选项。

什么可能导致泄漏，我该怎么办才能解决这个问题？为什么其他用户没有注意到同样的问题？尽管泄漏很小，但它足够大，可以使我的训练课程不记忆力并且失败了，当我违反自己的数据几天时。无论我使用的GPU数量如何。

都会发生泄漏。

我使用的gcloud命令是：

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

配置文件(config.yml(是：

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu

任何帮助都赞赏，谢谢。

我们建议使用此版本的代码：

github.com/tensorflow/models/pull/1538

具有性能优势(通过更少的时间运行，您不容易对OOMS(。

当然，这可能不是永久解决方案，但是，根据我们的测试，Tensorflow 1.2似乎可以解决该问题。TensorFlow 1.2即将在CloudMl发动机上提供。如果您继续遇到问题，请告诉我们。

相关内容

最新更新

热门标签：