我正在寻找一种优化的解决方案,以使用 pytorch 数据加载器加载多个巨大的 .npy 文件。 我目前正在使用以下方法,该方法为每个纪元中的每个文件创建一个新的数据加载器。
我的数据加载器是这样的:
class GetData(torch.utils.data.Dataset):
def __init__(self, data_path, target_path, transform=None):
with open(data_path, 'rb') as train_pkl_file:
data = pickle.load(train_pkl_file)
self.data = torch.from_numpy(data).float()
with open(target_path, 'rb') as target_pkl_file:
targets = pickle.load(target_pkl_file)
self.targets = torch.from_numpy(targets).float()
def __getitem__(self, index):
x = self.data[index]
y = self.targets[index]
return index, x, y
def __len__(self):
num_images = self.data.shape[0]
return num_images
我有一个 npy 文件列表:
list1 = ['d1.npy', 'd2.npy','d3.npy']
list1 = ['s1.npy', 's2.npy','s3.npy']
我创建了一个数据加载器,它给出了文件名
class MyDataset(torch.utils.data.Dataset):
def __init__(self,flist):
self.npy_list1 = flist1
self.npy_list2 = flist2
def __getitem__(self, idx):
filename1 = self.npy_list1[idx]
filename2 = self.npy_list2[idx]
return filename1,filename2
def __len__(self):
return len(self.npy_list1)
我通过它们如下:
for epoch in range(500):
print('Epoch #%s' % epoch)
model.train()
loss_, elbo_, recon_ = [[] for _ in range(3)]
running_loss = 0
# FOR EVERY SMALL FILE
print("Training: ")
# TRAIN HERE
my_dataset = MyDataset(npyList)
for idx, (dynamic_file, static_file) in tqdm(enumerate(my_dataset)):
...Do stuff ....
上述方法有效,但我正在寻找更节省内存的解决方案。注意:我有大量数据> 200 GB,因此将 numpy 数组连接成 1 个文件可能不是解决方案(由于 RAM 限制(。 提前致谢
根据numpy.load,你可以设置参数mmap_mode='r'
来接收内存映射数组numpy.memmap。
内存映射阵列保存在磁盘上。但是,它可以像任何ndarray一样被访问和切片。内存映射对于访问大文件的小片段而不将整个文件读入内存特别有用。
我尝试实现使用内存映射的数据集。首先,我生成了一些数据,如下所示:
import numpy as np
feature_size = 16
total_count = 0
for index in range(10):
count = 1000 * (index + 1)
D = np.random.rand(count, feature_size).astype(np.float32)
S = np.random.rand(count, 1).astype(np.float32)
np.save(f'data/d{index}.npy', D)
np.save(f'data/s{index}.npy', S)
total_count += count
print("Dataset size:", total_count)
print("Total bytes:", total_count * (feature_size + 1) * 4, "bytes")
输出为:
Dataset size: 55000
Total bytes: 3740000 bytes
然后,我对数据集的实现如下所示:
import numpy as np
import torch
from bisect import bisect
import os, psutil # used to monitor memory usage
class BigDataset(torch.utils.data.Dataset):
def __init__(self, data_paths, target_paths):
self.data_memmaps = [np.load(path, mmap_mode='r') for path in data_paths]
self.target_memmaps = [np.load(path, mmap_mode='r') for path in target_paths]
self.start_indices = [0] * len(data_paths)
self.data_count = 0
for index, memmap in enumerate(self.data_memmaps):
self.start_indices[index] = self.data_count
self.data_count += memmap.shape[0]
def __len__(self):
return self.data_count
def __getitem__(self, index):
memmap_index = bisect(self.start_indices, index) - 1
index_in_memmap = index - self.start_indices[memmap_index]
data = self.data_memmaps[memmap_index][index_in_memmap]
target = self.target_memmaps[memmap_index][index_in_memmap]
return index, torch.from_numpy(data), torch.from_numpy(target)
# Test Code
if __name__ == "__main__":
data_paths = [f'data/d{index}.npy' for index in range(10)]
target_paths = [f'data/s{index}.npy' for index in range(10)]
process = psutil.Process(os.getpid())
memory_before = process.memory_info().rss
dataset = BigDataset(data_paths, target_paths)
used_memory = process.memory_info().rss - memory_before
print("Used memory:", used_memory, "bytes")
dataset_size = len(dataset)
print("Dataset size:", dataset_size)
print("Samples:")
for sample_index in [0, dataset_size//2, dataset_size-1]:
print(dataset[sample_index])
输出如下:
Used memory: 299008 bytes
Dataset size: 55000
Samples:
(0, tensor([0.5240, 0.2931, 0.9039, 0.9467, 0.8710, 0.2147, 0.4928, 0.8309, 0.7344, 0.2861, 0.1557, 0.7009, 0.1624, 0.8608, 0.5378, 0.4304]), tensor([0.7725]))
(27500, tensor([0.8109, 0.3794, 0.6377, 0.4825, 0.2959, 0.6325, 0.7278, 0.6856, 0.1037, 0.3443, 0.2469, 0.4317, 0.6690, 0.4543, 0.7007, 0.5733]), tensor([0.7856]))
(54999, tensor([0.4013, 0.9990, 0.9107, 0.9897, 0.0204, 0.2776, 0.5529, 0.5752, 0.2266, 0.9352, 0.2130, 0.9542, 0.4116, 0.4959, 0.1436, 0.9840]), tensor([0.6342]))
根据结果,内存使用量仅为总大小的 10%。我没有尝试使用非常大的文件大小的代码,所以我不知道处理>200 GB 文件的效率如何。如果您可以尝试一下并告诉我有和没有memmap的内存使用情况,我将不胜感激。