使用python在内存中创建大的zip档案,其中包含许多小文件(动态)



任务是:

  1. 从S3存储中逐个读取多个文件
  2. 将文件添加到big_archive.zip
  3. big_archive.zip存储在S3存储

问题:

当我们将新文件附加到zip档案时,zip库会更改当前档案(更新元信息(,然后添加文件内容(字节(。因为归档文件很大,我们需要将其分块存储到S3存储中。但是!已经存储的块无法重写。正因为如此,我们无法更新元信息。

此代码解释问题:

from io import BytesIO
import zipfile, sys, gc
files = (
'input/i_1.docx',  # one file size is about ~500KB
'input/i_2.docx',
...
'input/i_11.docx',
'input/i_12.docx',
'input/i_13.docx',
'input/i_14.docx'
)

# this function allow to get size of in-memory object
# thanks to
# https://towardsdatascience.com/the-strange-size-of-python-objects-in-memory-ce87bdfbb97f
def _get_size(input_obj):
memory_size = 0
ids = set()
objects = [input_obj]
while objects:
new = []
for obj in objects:
if id(obj) not in ids:
ids.add(id(obj))
memory_size += sys.getsizeof(obj)
new.append(obj)
objects = gc.get_referents(*new)
return memory_size

# open in-memory object
with BytesIO() as zip_obj_in_memory:
# open zip archive on disk
with open('tmp.zip', 'wb') as resulted_file:
# set chunk size to 1MB
chunk_max_size = 1048576  # 1MB
# iterate over files
for f in files:
# get size of in-memory object
current_size = _get_size(zip_obj_in_memory)
# if size of in-memory object is bigger than 1MB
# we need to drop it to S3 storage
if current_size > chunk_max_size:
# write file on disk (that is no matter what storge is: S3 or disk)
resulted_file.write(zip_obj_in_memory.getvalue())
# remove current in-memory data
zip_obj_in_memory.seek(0)
# zip_obj_in_memory size is 0MB after truncate so we able to adding new files
zip_obj_in_memory.truncate()
# main process open ip_obj_in_memory object in append mode and append new files
with zipfile.ZipFile(zip_obj_in_memory, 'a', compression=zipfile.ZIP_DEFLATED) as zf:
# read file and write it to archive
with open(f, 'rb') as o:
zf.writestr(
zinfo_or_arcname=f.replace('input/', 'output/'),
data=o.read()
)
# write last chunk of data
resulted_file.write(zip_obj_in_memory.getvalue())

现在尝试获取档案中的文件:

unzip -l tmp.zip
Archive:  tmp.zip
warning [tmp.zip]:  6987483 extra bytes at beginning or within zipfile
(attempting to process anyway)
Length      Date    Time    Name
---------  ---------- -----   ----
583340  12-15-2021 18:43   output/i_13.docx
583335  12-15-2021 18:43   output/i_14.docx
---------                     -------
1166675                     2 files

正如我们所看到的,只有最后1MB的区块显示

让我们修复这个档案:

zip -FF tmp.zip --out fixed.zip
Fix archive (-FF) - salvage what can
Found end record (EOCDR) - says expect single disk archive
Scanning for entries...
copying: output/i_1.docx  (582169 bytes)
copying: output/i_2.docx  (582152 bytes)
Central Directory found...
EOCDR found ( 1 1164533)...
copying: output/i_3.docx  (582175 bytes)
Entry after central directory found ( 1 1164555)...
copying: output/i_4.docx  (582175 bytes)
Central Directory found...
EOCDR found ( 1 2329117)...
copying: output/i_5.docx  (582176 bytes)
Entry after central directory found ( 1 2329139)...
copying: output/i_6.docx  (582180 bytes)
Central Directory found...
EOCDR found ( 1 3493707)...
copying: output/i_7.docx  (582170 bytes)
Entry after central directory found ( 1 3493729)...
copying: output/i_8.docx  (582174 bytes)
Central Directory found...
...

之后:

unzip -l fixed.zip
Archive:  fixed.zip
Length      Date    Time    Name
---------  ---------- -----   ----
583344  12-15-2021 18:43   output/i_1.docx
583337  12-15-2021 18:43   output/i_2.docx
583346  12-15-2021 18:43   output/i_3.docx
583352  12-15-2021 18:43   output/i_4.docx
583361  12-15-2021 18:43   output/i_5.docx
583368  12-15-2021 18:43   output/i_6.docx
583356  12-15-2021 18:43   output/i_7.docx
583362  12-15-2021 18:43   output/i_8.docx
583337  12-15-2021 18:43   output/i_9.docx
583352  12-15-2021 18:43   output/i_10.docx
583363  12-15-2021 18:43   output/i_11.docx
583368  12-15-2021 18:43   output/i_12.docx
583340  12-15-2021 18:43   output/i_13.docx
583335  12-15-2021 18:43   output/i_14.docx
---------                     -------
8166921                     14 files

文件提取也很好
文件内容正确。

根据维基百科

所需的元信息存储在中央目录(CD(中

因此,我们需要在每次文件附加时删除Central directory信息(在将文件存储到磁盘(或S3(之前(,并最终手动添加有关所有文件的正确信息。

有可能吗?如果是,如何做到这一点。

至少这里有任何方法可以在人类可读的二进制模式下区分tmp.zip和fixed.zip,以便能够检查CD存储在哪里以及它的格式。
也欢迎对zip的任何准确引用来帮助解决这个问题。

好吧,最后我创建了那个zip的弗兰肯斯坦:

from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED
import sys
import gc
files = (
'input/i_1.docx',  # one file size is about ~580KB
'input/i_2.docx',
'input/i_3.docx',
'input/i_4.docx',
'input/i_5.docx',
'input/i_6.docx',
'input/i_7.docx',
'input/i_8.docx',
'input/i_9.docx',
'input/i_10.docx',
'input/i_11.docx',
'input/i_12.docx',
'input/i_13.docx',
'input/i_14.docx',
'input/i_21.docx'
)

# this function allow to get size of in-memory object
# add only for debug purposes
def _get_size(input_obj):
memory_size = 0
ids = set()
objects = [input_obj]
while objects:
new = []
for obj in objects:
if id(obj) not in ids:
ids.add(id(obj))
memory_size += sys.getsizeof(obj)
new.append(obj)
objects = gc.get_referents(*new)
return memory_size

class CustomizedZipFile(ZipFile):
# add customized BytesIO to be able return faked offset
class _CustomizedBytesIO(BytesIO):
def __init__(self, fake_offset: int):
self.fake_offset = fake_offset
self.temporary_switch_to_faked_offset = False
super().__init__()
def tell(self):
if self.temporary_switch_to_faked_offset:
# revert tell method to normal mode to minimize faked behaviour
self.temporary_switch_to_faked_offset = False
return super().tell() + self.fake_offset
else:
return super().tell()
def __init__(self, *args, **kwargs):
# create empty file to write if fake offset is set
if 'fake_offset' in kwargs and kwargs['fake_offset'] is not None and kwargs['fake_offset'] > 0:
self._fake_offset = kwargs['fake_offset']
del kwargs['fake_offset']
if 'file' in kwargs:
kwargs['file'] = self._CustomizedBytesIO(self._fake_offset)
else:
args = list(args)
args[0] = self._CustomizedBytesIO(self._fake_offset)
else:
self._fake_offset = 0
super().__init__(*args, **kwargs)
# finalize zip (should be run only on last chunk)
def force_write_end_record(self):
self._write_end_record(False)
# don't write end record by default to be able get not ended chunks
# ZipFile writing end metainfo on close by default
def _write_end_record(self, skip_write_end=True):
if not skip_write_end:
if self._fake_offset > 0:
self.start_dir = self._fake_offset
self.fp.temporary_switch_to_faked_offset = True
super()._write_end_record()

def archive(files):
compression_type = ZIP_DEFLATED
CHUNK_SIZE = 1048576  # 1MB
with open('tmp.zip', 'wb') as resulted_file:
offset = 0
filelist = []
with BytesIO() as chunk:
for f in files:
with BytesIO() as tmp:
with CustomizedZipFile(tmp, 'w', compression=compression_type) as zf:
with open(f, 'rb') as b:
zf.writestr(
zinfo_or_arcname=f.replace('input/', 'output/'),
data=b.read()
)
zf.filelist[0].header_offset = offset
data = tmp.getvalue()
offset = offset + len(data)
filelist.append(zf.filelist[0])
chunk.write(data)
print('size of zipfile:', _get_size(zf))
print('size of chunk:', _get_size(chunk))
if len(chunk.getvalue()) > CHUNK_SIZE:
resulted_file.write(chunk.getvalue())
chunk.seek(0)
chunk.truncate()
# write last chunk
resulted_file.write(chunk.getvalue())
# file parameter may be skipped it we using fake_offset
# because empty _CustomizedBytesIO will be initialized at constructor
with CustomizedZipFile(None, 'w', compression=compression_type, fake_offset=offset) as zf:
zf.filelist = filelist
zf.force_write_end_record()
end_data = zf.fp.getvalue()
resulted_file.write(end_data)

archive(files)

输出为:

size of zipfile: 2182955
size of chunk: 582336
size of zipfile: 2182979
size of chunk: 1164533
size of zipfile: 2182983
size of chunk: 582342
size of zipfile: 2182979
size of chunk: 1164562
size of zipfile: 2182983
size of chunk: 582343
size of zipfile: 2182979
size of chunk: 1164568
size of zipfile: 2182983
size of chunk: 582337
size of zipfile: 2182983
size of chunk: 1164556
size of zipfile: 2182983
size of chunk: 582329
size of zipfile: 2182984
size of chunk: 1164543
size of zipfile: 2182984
size of chunk: 582355
size of zipfile: 2182984
size of chunk: 1164586
size of zipfile: 2182984
size of chunk: 582338
size of zipfile: 2182984
size of chunk: 1164545
size of zipfile: 2182980
size of chunk: 582320

因此,我们可以看到,当达到最大块大小(在我的情况下为1MB(时,块总是被转储到存储中并被截断

使用MacOS Unarchiver v4.2.4和Windows 10默认的Unarchiver和7-zip 测试的结果存档

注意

块存档创建的文件大小比普通zipfile库创建的文件大16字节。可能在某个地方写入了一些额外的zero字节。我没有检查为什么会发生

zipfile是我以前见过的最糟糕的python库。看起来它应该被用作不可扩展的二进制文件

最新更新