如何在python中解压缩部分分块的gzip数据



在python中使用gzipzlib搜索类似的解决方案
这个SO问题如何膨胀部分zlib文件不起作用(请参阅第一个测试用例(
不是使用python解压缩.gz文件的一部分的副本,这是不起作用的(而且已经过时了(
这两个问题:使用python-gzip模块解压缩文件的一个部分,以及在知道文件的第一个字节的情况下,是否可以计算如何解压缩文件?接近这个问题(尽管不同(,但不幸的是,第一个问题没有有效的解决方案,第二个问题根本没有任何答案。。。

我正在对从远程服务器接收到的一块gzip字节进行迭代,它看起来像这样:

async with aiohttp.ClientSession() as session:
async with session.get(LINK) as response:
with open(FILE, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
# Write the decompressed chunk
# to `f`
...

以下是不起作用的解决方案:
1(

decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
async with aiohttp.ClientSession() as session:
async with session.get(LINK) as response:
with open(FILE, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
# Write the decompressed chunk
r = decompressor.decompress(chunk, chunk_size)
# for some reason `r` is always empty
# writing to `f` is pointless
print(f"{len(chunk) = }, {r = }, {len(r) = }")

在这里,r似乎是空的
stdout:

len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
...

2(
在部分数据上执行zlib.decompress(...)似乎也不起作用

async with aiohttp.ClientSession() as session:
async with session.get(LINK) as response:
with open(DIR, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
f.write(zlib.decompress(chunk))

这引发了:

Traceback (most recent call last):
File "c:UsersluminDesktoprplaceget_data.py", line 54, in <module>
asyncio.run(main())
File "C:UsersluminAppDataLocalProgramsPythonPython310libasynciorunners.py", line 44, in run
return loop.run_until_complete(main)
File "C:UsersluminAppDataLocalProgramsPythonPython310libasynciobase_events.py", line 641, in run_until_complete
return future.result()
File "c:UsersluminDesktoprplaceget_data.py", line 51, in main
await download_content(0)
File "c:UsersluminDesktoprplaceget_data.py", line 47, in download_content
f.write(zlib.decompress(chunk))
zlib.error: Error -3 while decompressing data: incorrect header check

3(
像这样在gzip.decompress(chunk)中通过:

with open(DIR, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
f.write(gzip.decompress(chunk))

原因:

Traceback (most recent call last):
File "c:UsersluminDesktoprplaceget_data.py", line 54, in <module>
asyncio.run(main())
File "C:UsersluminAppDataLocalProgramsPythonPython310libasynciorunners.py", line 44, in run
return loop.run_until_complete(main)
File "C:UsersluminAppDataLocalProgramsPythonPython310libasynciobase_events.py", line 641, in run_until_complete
return future.result()
File "c:UsersluminDesktoprplaceget_data.py", line 51, in main
await download_content(0)
File "c:UsersluminDesktoprplaceget_data.py", line 47, in download_content
f.write(gzip.decompress(chunk))
File "C:UsersluminAppDataLocalProgramsPythonPython310libgzip.py", line 557, in decompress
return f.read()
File "C:UsersluminAppDataLocalProgramsPythonPython310libgzip.py", line 301, in read
return self._buffer.read(size)
File "C:UsersluminAppDataLocalProgramsPythonPython310lib_compression.py", line 118, in readall
while data := self.read(sys.maxsize):
File "C:UsersluminAppDataLocalProgramsPythonPython310libgzip.py", line 479, in read
self._read_eof()
File "C:UsersluminAppDataLocalProgramsPythonPython310libgzip.py", line 523, in _read_eof
crc32, isize = struct.unpack("<II", self._read_exact(8))
File "C:UsersluminAppDataLocalProgramsPythonPython310libgzip.py", line 425, in _read_exact
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

完整的代码看起来像这样:

from typing import Final
import aiohttp
import asyncio
import os

if os.name == "nt":
# Prevent noisy exit on Windows
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

async def download_content(
number: int, *, directory: str | None = None, chunk_size: int = 64
) -> None:
"""
Download the content of a archived canvas history file.
And extracts it immediately.
Args:
number: The number associated with the archive.
directory: The directory to extract the file to, defaults to root.
chunk_size: The size of the chunks to download and extract, defaults to 64.
Raises:
TypeError: Argument got invalid type.
ValueError: number wasn't between 0 and 77.
"""
if not isinstance(number, int):
raise TypeError(f"'number' must be of type 'int' got {type(number)}")
if not isinstance(directory, str) and directory is not None:
raise TypeError(f"'directory' must be of type 'str' got {type(directory)}")
if not isinstance(chunk_size, int):
raise TypeError(f"'chunk_size' must be of type 'int' got {type(chunk_size)}")
if 0 > number > 77:
raise ValueError(f"'number' must be between 0 and 77 got {number}")
LINK: Final[str] = "https://placedata.reddit.com/data/canvas-history/2022_place_canvas_history-"
FILE_LOCATION: Final[str] = f"{'0' * (12 - len(str(number)))}{number}.csv.gzip"
DIR: Final[str] = directory if directory is not None else "./"
async with aiohttp.ClientSession() as session:
async with session.get(LINK + FILE_LOCATION) as response:
with open(DIR + FILE_LOCATION[:-5], "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
# Write the decompressed chunk to the file
...

async def main():
await download_content(0)

asyncio.run(main())

TLDR:我们收到了一个gzip文件,正在对块进行迭代,我们有兴趣解压缩这些部分数据并将其写入文件。

我不认为decompress()的第二个参数意味着你认为的意思。它不是输入的长度(字节数组本身已经提供(,而是对返回的解压缩数据长度的约束。您甚至不应该指定它,允许decompress()返回迄今为止所有解压缩的数据。

下面的代码对我很有用。我使用split -b 64将gzip文件拆分为64字节的块xaaxab等,然后运行下面的命令并使用参数x??按顺序提供这些块。组合的解压缩结果已正确写入stdout。

#!/usr/bin/python3
import sys
import zlib
gz = zlib.decompressobj(31)
for arg in sys.argv[1:]:
with open(arg, "rb") as f:
chunk = f.read()
sys.stdout.buffer.write(gz.decompress(chunk))
sys.stdout.buffer.write(gz.flush())

(实际上并不需要最后的刷新,因为最后一个decompress()将返回最后一个区块中所有解压缩的数据。为了完整起见,我将其包括在内,以有效地关闭解压缩对象并释放它隔离的任何资源。(

相关内容

  • 没有找到相关文章

最新更新