tensorflow训练在kaggle tpu随机冻结

我正在使用kaggle TPU来训练一个tensorflow CycleGAN模型。训练开始后一切正常，但训练在几个模型后随机冻结。根据kaggle, RAM在训练期间没有爆炸。

我在训练中遇到过这样的警告:

2022-11-28 07:22:58.323282: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1669620178.323159560","description":"Error received from peer ipv4:10.0.0.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 89987, Output num: 0","grpc_status":3}
Epoch 5/200

当我配置tpu时，我的警告为:

2022-11-28 13:56:35.038036: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.040789: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2022-11-28 13:56:35.040821: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-28 13:56:35.040850: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (06e37d3ac4e4): /proc/driver/nvidia/version does not exist
2022-11-28 13:56:35.043518: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-28 13:56:35.044759: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-11-28 13:56:35.079672: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.2:8470}
2022-11-28 13:56:35.079743: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.098707: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.2:8470}
2022-11-28 13:56:35.098760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30020}
2022-11-28 13:56:35.101231: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:30020

Tensorflow版本是2.4.1，其他配置我没有碰过。我的model.fit函数看起来像这样:

history = gan_model.fit(gan_ds, 
epochs=EPOCHS, 
callbacks=[GANMonitor()], 
steps_per_epoch=(max(n_monet_samples, n_photo_samples)//BATCH_SIZE), 
verbose=2,
workers=0).history

代码的大部分来自于kaggle教程，但是我改变了模型架构。有办法解决这个问题吗?

🙏

我已经尝试将其配置为verbose=1，并看到训练冻结在一个纪元的中间随机步骤。我能够经历的epoch的数量似乎取决于模型架构和批处理大小，所以我认为内存有问题?

我尝试在v3-8上运行以下两个教程，并且在两次运行中都遇到了类似的警告。

https://www.kaggle.com/code/philculliton/a-simple-petals-tf-2-2-notebook
https://www.kaggle.com/code/amyjang/monet-cyclegan-tutorial

但是他们没有中断训练。

你能检查一下原始教程代码是否运行了大量的epoch吗?如果是，您可能需要检查您对模型体系结构的更改。

另外，如果batch_size影响了训练epoch的数量，那么很可能是内存不足错误。尝试将batch_size减小到每核128的因数，看看运行是否完成。

更多资源-

不适当的batch_size如何导致OOM - https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies
配置指南- https://cloud.google.com/tpu/docs/cloud-tpu-tools

随时探索我们的tpu深入指南与优秀的教程- https://cloud.google.com/tpu/docs/intro-to-tpu

相关内容

最新更新

热门标签：