Pytorch 数据加载器:错误的文件描述符和工人的 EOF>0



问题描述

在使用由自定义数据集制成的Pytorch数据加载器进行神经网络训练时,我遇到了一种奇怪的行为。数据加载器设置为worker=4,pin_memory=False。

大部分时间,训练都顺利结束。有时,训练会在随机时刻停止,并出现以下错误:

  1. OSError:[Erno 9]错误的文件描述符
  2. EOF错误

在创建套接字以访问数据加载器元素的过程中似乎发生了错误。当我将工人数量设置为0时,错误就会消失,但我需要加快多处理的培训。错误的来源可能是什么?非常感谢。

python和库的版本

Python 3.9.12,Pyorch 1.11.0+cu102
编辑:错误仅发生在集群上

错误文件的输出

Traceback (most recent call last):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve
Epoch 17:  52%|█████▏    | 253/486 [01:00<00:55,  4.18it/s, loss=1.73]
Traceback (most recent call last):
File "/my_directory/bench/run_experiments.py", line 251, in <module>
send(conn, destination_pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 545, in fromfd
return socket(family, type, proto, nfd)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 232, in __init__
_socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 9] Bad file descriptor
main(args)
File "/my_directory/bench/run_experiments.py", line 183, in main
run_experiments(args, save_path)
File "/my_directory/bench/run_experiments.py", line 70, in run_experiments
) = run_algorithm(algorithm_params[j], mp[j], ss, dataset)
File "/my_directorybench/algorithms.py", line 38, in run_algorithm
data = es(mp,search_space,  dataset, **ps)
File "/my_directorybench/algorithms.py", line 151, in es
data = ss.generate_random_dataset(mp,
File "/my_directorybench/architectures.py", line 241, in generate_random_dataset
arch_dict = self.query_arch(
File "/my_directory/bench/architectures.py", line 71, in query_arch
train_losses, val_losses, model = meta_net.get_val_loss(
File "/my_directory/bench/meta_neural_net.py", line 50, in get_val_loss
return self.training(
File "/my_directorybench/meta_neural_net.py", line 155, in training
train_loss = self.train_step(model, device, train_loader, epoch)
File "/my_directory/bench/meta_neural_net.py", line 179, in train_step
for batch_idx, mini_batch in enumerate(pbar):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/my_directory/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError

编辑:访问数据的方式

from PIL import Image
from torch.utils.data import DataLoader

# extract of code of dataset

class Dataset():
def __init__(self,image_files,mask_files):
self.image_files = image_files
self.mask_files = mask_files

def __getitem__(self, idx):
img = Image.open(self.image_files[idx]).convert('RGB')
mask=Image.open(self.mask_files[idx]).convert('L')
return img, mask

# extract of code of trainloader

train_loader = DataLoader(
dataset=train_dataset,
batch_size=4,
num_workers=4,
pin_memory=False,
shuffle=True,
drop_last=True,
persistent_workers=False,
)

我终于找到了解决方案。将此配置添加到数据集脚本中有效:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

默认情况下,共享策略设置为'file_descriptor'

我尝试了一些解决方案,解释如下:

  • 这个问题(增加共享内存,增加打开的文件描述符的最大数量,在每个epoch结束时使用torch.cuda.empty_cache((,…(
  • 而另一个问题,最终解决了这个问题

正如@AlexMeredith所建议的,错误可能与一些集群使用的分布式文件系统(Lustre(有关。错误也可能来自分布式共享内存。

在这个例子中,只有数据集实现,但没有显示批次发生了什么的片段。

在我的例子中,我将批存储在类似索引数组的对象中,幸运的是,这里已经描述了这个对象。因此,数据加载器无法关闭子流程。实施类似的方法帮助我解决了这个问题。

import copy
for batch in data_loader:  
batch_cp = copy.deepcopy(batch)  
del batch  
index.append(batch_cp["index"])

我还收到了与此相关的其他错误,例如:

  • received 0 items of ancdata
  • bad message length

最新更新