在第14纪元的Imagenet上训练Resnet50时出错



我正在使用PyTorch提供的脚本在imagenet上训练Resnet50(为了我的目的,做了一个小小的调整(。然而,经过14个时期的训练,我出现了以下错误。我已经在我用来运行这个的服务器上分配了4个gpu。任何关于这个错误是关于什么的指针都将受到赞赏。非常感谢!

Epoch: [14][5000/5005]  Time 1.910 (2.018)  Data 0.000 (0.191)  Loss 2.6954 (2.7783)    Total 2.6954 (2.7783)   Reg 0.0000  Prec@1 42.969 (40.556)  Prec@5 64.844 (65.368)   
Test: [0/196]   Time 86.722 (86.722)    Loss 1.9551 (1.9551)    Prec@1 51.562 (51.562)  Prec@5 81.641 (81.641)
Traceback (most recent call last):
File "main_group.py", line 549, in <module>
File "main_group.py", line 256, in main

File "main_group.py", line 466, in validate
if args.gpu is not None:
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
return self._process_data(data)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
sample = self.loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
return pil_loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'

仅仅通过查看您发布的错误很难判断问题出在哪里。

我们只知道在'/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'读取文件时出现问题。

尝试以下操作:

  1. 确认文件确实存在
  2. 确认它实际上是一个有效的JPEG并且没有损坏(通过查看它(
  3. 确认您可以用Python打开它,也可以用PIL手动加载它
  4. 如果这些都不起作用,请尝试删除该文件。你在文件夹中的另一个文件上收到同样的错误吗

最新更新