当填充shuffle缓冲区时,Google协作会话突然结束



我正在使用Google Colaboratory来训练一种图像识别算法,使用TensorFlow 1.15。我已经将所有需要的文件上传到Google Drive,并获得了运行代码,直到shuffle缓冲区完成运行。然而,我得到了一个"^C";在对话框中,无法弄清楚发生了什么。

注意:我之前曾尝试在电脑上训练算法,但没有删除上一次训练中生成的检查点文件。这可能是问题所在吗?

代码:

!pip install --upgrade pip
!pip install --upgrade protobuf
!pip install tensorflow-gpu==1.15
import tensorflow as tf
print(tf.__version__)
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at {}'.format(device_name))
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
from google.colab import drive
#Mount the drive
drive.mount('/content/gdrive')
#Change to working tensorflow directory on the drive
%cd '/content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/'
!apt-get install protobuf-compiler python-pil python-lxml python-tk
!pip install Cython
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/
!protoc object_detection/protos/*.proto --python_out=.
import os
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'
!python setup.py build
!python setup.py install
import time, psutil
Start = time.time() - psutil.boot_time()
Left = 12*3600 - Start
print('Time remaining for this session is: ', Left/3600)
!pip install tf_slim
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'
!python train.py --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config --logtostderr

该过程在此终止,但它需要开始用";全球步骤">

2020-10-18 22:42:45.587477: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 168 of 2048
2020-10-18 22:42:55.668973: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 334 of 2048
2020-10-18 22:43:06.067869: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 379 of 2048
2020-10-18 22:43:15.705090: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 503 of 2048
2020-10-18 22:43:26.781151: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 576 of 2048
2020-10-18 22:43:38.120069: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 640 of 2048
2020-10-18 22:43:45.813089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 708 of 2048
2020-10-18 22:43:58.071040: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 752 of 2048
2020-10-18 22:44:07.506961: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 828 of 2048
2020-10-18 22:44:16.355753: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 908 of 2048
2020-10-18 22:44:25.922348: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 960 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:44:34.783342 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:44:36.327813: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1036 of 2048
2020-10-18 22:44:45.651473: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1151 of 2048
2020-10-18 22:44:55.554234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1186 of 2048
2020-10-18 22:45:05.648568: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1242 of 2048
2020-10-18 22:45:15.644396: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1313 of 2048
2020-10-18 22:45:25.551708: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1386 of 2048
2020-10-18 22:45:35.549003: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1458 of 2048
2020-10-18 22:45:45.648835: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1531 of 2048
2020-10-18 22:45:55.643920: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1602 of 2048
2020-10-18 22:46:05.559702: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1674 of 2048
2020-10-18 22:46:15.547609: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1746 of 2048
2020-10-18 22:46:25.645939: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1819 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:46:35.052108 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:46:35.645583: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1891 of 2048
2020-10-18 22:46:45.553851: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1962 of 2048
^C

我能做些什么来解决这个问题?训练过程在我的PC(NVIDA GEFORCE RTX(上运行得很好,但我只需要通过Google Colab获得更多的计算能力。

我不能运行你的代码,因为你在其中使用了一些文件。但我可以告诉你,这可能是因为你使用的是带有GPU的TF 1,在Colab中,当涉及到GPU时,降级并不容易。

例如,我没有在你的代码中看到你已经将CUDA降级(到你想要的版本(如下:

!wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
!apt-get update
!apt-get install cuda=9.0.176-1

您可以通过!nvcc --version检查CUDA的版本。

AND Colab降级TensorFlow版本的速度并不快。您可能需要多次重新启动运行时。

我建议您将代码更改为TensorFlow 2

最新更新