Python转储一个非常大的列表

我有两个目录，每个目录包含大约50,000张图像，它们大多是240x180大小。

我想pickle他们的像素信息作为训练，验证和测试集，

，但这显然是非常非常大，最终导致计算机释放或耗尽磁盘空间。

当计算机死机时，正在生成的pkl文件为28GB。

我不确定这是不是应该这么大。

我做错了什么吗?或者有没有更有效的方法?

from PIL import Image
import pickle
import os
indir1 = 'Positive'
indir2 = 'Negative'
trainimage = []
trainpixels = []
trainlabels = []
validimage = []
validpixels = []
validlabels = []
testimage = []
testpixels = []
testlabels = []

i=0
for (root, dirs, filenames) in os.walk(indir1):
    print 'hello'
    for f in filenames:
        try:
            im = Image.open(os.path.join(root,f))
            if i<40000:
                trainpixels.append(im.tostring())
                trainlabels.append(0)
            elif i<45000:
                validpixels.append(im.tostring())
                validlabels.append(0)
            else:
                testpixels.append(im.tostring())
                testlabels.append(0)
            print str(i)+'t'+str(f)
            i+=1
        except IOError:
            continue
i=0
for (root, dirs, filenames) in os.walk(indir2):
print 'hello'
    for f in filenames:
        try:
            im = Image.open(os.path.join(root,f))
            if i<40000:
                trainpixels.append(im.tostring())
                trainlabels.append(1)
            elif i<45000:
                validpixels.append(im.tostring())
                validlabels.append(1)
            else:
                testpixels.append(im.tostring())
                testlabels.append(1)
            print str(i)+'t'+str(f)
            i+=1
        except IOError:
            continue
trainimage.append(trainpixels)
trainimage.append(trainlabels)
validimage.append(validpixels)
validimage.append(validlabels)
testimage.append(testpixels)
testimage.append(testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)

pickle文件格式不是特别有效，特别是对于图像而言。即使像素以每像素1字节的形式存储，也会有

50000报;240报;180 =;2160000000

so 2gb。您的像素无疑占用了更多的空间，我不确定PIL tostring()方法在图像上实际做了什么。结果文件可能有几十gb，这是完全合理的。

您可能需要考虑使用pickle以外的存储方法。例如，简单地将文件以其本机映像格式存储在磁盘上，并pickle文件名列表有什么问题呢?

我同意您可能不应该将大量的pickle图像存储到磁盘上…除非您绝对需要(无论出于何种原因)。你可能应该买一个真正大的磁盘，有一些真正好的内存，和大量的处理能力。

无论如何，如果你把你的图像数据传输到numpy。数组，使用scipy.ndimage.imread，然后您可以使用numpy内部格式加上压缩将图像存储到磁盘。

有像klepto这样的包可以让你很容易做到这一点。

>>> from klepto.archives import dir_archive
>>> from scipy import ndimage
>>> demo = dir_archive('demo', {}, serialized=True, compression=9, cached=False)
>>> demo['image1'] = ndimage.imread('image1')
>>> demo['image2'] = ndimage.imread('image2')

现在您有一个字典接口来numpy内部表示压缩的pickle图像文件，在名为demo的目录中每个文件有一个图像(也许您需要添加fast=True标志，我不记得了)。所有的字典方法几乎都是可用的，因此您可以根据分析需要访问图像，然后使用del demo['image1']或类似的方法丢弃pickle图像。

您还可以使用klepto轻松地提供自定义编码，这样您就有了相当加密的数据存储。您甚至可以选择不加密/pickle数据，而只对磁盘上的文件有一个字典接口——这通常本身就很方便。

如果您不关闭缓存，您可能会达到计算机内存或磁盘大小的限制，除非您注意转储和加载映像到磁盘的顺序。在上面的示例中，我关闭了对内存的缓存，因此它直接写入磁盘。还有其他选项，例如使用内存映射模式和写入HDF文件。对于要在一台机器上处理的大型数组数据，我通常使用上面的方案，并且可能会选择MySQL归档后端来处理由多台机器并行访问的更小的数据。

在此获取klepto: https://github.com/uqfoundation

相关内容

最新更新

热门标签：