如何使用gpu并行训练tensorflow.keras模型?Tensorflow 2.5.0版



我有以下代码运行一个自定义模型,该模型位于不同的模块中,并将几个参数(学习率、卷积内核大小等(作为输入

custom_model是在tensorflow中编译tensorflow.keras.models.Model并返回模型的函数。

  • LOW是训练数据集

  • HIGH是目标数据集

我通过hdf5文件加载了这两个数据集,但数据集相当大,大约为10 GB。

通常情况下,我在jupyter实验室中运行这个程序没有任何问题,并且该模型不会消耗GPU上的资源。最后,我保存了不同参数的权重。

现在我的问题是:

如何将其作为一个脚本,并对k1k2的不同值并行运行。我想类似bash循环的方法就可以了,但我想避免重新读取数据集。我正在使用Windows 10作为操作系统。

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
tf.config.experimental.set_memory_growth(gpu_instance, True)
import h5py
from model_custom import custom_model
winx = 100
winz = 10
k1 = 9
k2 = 5
with h5py.File('MYFILE', 'r') as hf:
LOW = hf['LOW'][:]
HIGH = hf['HIGH'][:]
with tf.device("/gpu:1"):
mymodel = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k1, kz2=k2)
myhistory = mymodel.fit(LOW, HIGH, batch_size=1, epochs=1)
mymodel.save_weights('zkernel_{}_kz1_{}_kz2_{}.hdf5'.format(winz, k1,k2))

我发现这个解决方案对我来说很好。这使我能够在gpu中使用MPI和mpi4py运行并行模型训练。当我试图加载大文件并同时运行多个进程时,只有一个问题,这样我加载的数据的进程次数就会超过我的ram容量。

from mpi4py import MPI 
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
tf.config.experimental.set_memory_growth(gpu_instance, True)
import h5py
from model_custom import custom_model
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
winx = 100
winy = 100
winz = 10
if rank == 10:
with h5py.File('mifile.hdf5', 'r') as hf:
LOW = hf['LOW'][:]
HIGH = hf['HIGH'][:]
else:
HIGH = None
LOW= None
HIGH = comm.bcast(HIGH, root=10)
LOW = comm.bcast(LOW, root=10)

if rank < 5:
with tf.device("/gpu:1"):
k = 9
q = rank +1
mymodel1 = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k, kz2=q)
mymodel1._name = '{}_{}_{}'.format(winz,k,q)
myhistory1 = mymodel1.fit(LOW, HIGH, batch_size=1, epochs=1)
mymodel1.save_weights(mymodel1.name +'winz_{}_k_{}_q_{}.hdf5'.format(winz, k,q))
elif 5 <= rank < 10: 
with tf.device("/gpu:2"):
k = 8
q = rank +1 -5
mymodel2 = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k, kz2=q)
mymodel2._name = '{}_{}_{}'.format(winz,k,q)
myhistory2 = mymodel2.fit(LOW, HIGH, batch_size=1, epochs=1)
mymodel2.save_weights(mymodel2.name +'winz_{}_k_{}_q_{}.hdf5'.format(winz, k,q))

然后我保存到一个名为mycode.py的python模块中,然后在控制台中运行

mpiexec -n 11 python ./mycode.py

相关内容

  • 没有找到相关文章