在 C 语言中打开和读取"large" gzip 压缩文件



我一直在尝试使用 gzip> gzip 使用基于GZIP的基于 file io io io函数在C中我和我在一起的大小很大12 GB。未压缩的文件为 〜260 GB ,因此我不准备使用Gunzip取消压缩文件,然后从那里开始。

我专门使用以下代码读写我们可用的缓冲区 -

#define windowBits 15
#define ENABLE_ZLIB_GZIP 32
#define CHUNK 0x4000
#define CALL_ZLIB(x) {  
    int status;     
    status = x;     
    if (status < 0) 
    {               
            fprintf(stderr, "%s:%d: %s returned a bad status of %d.n", __FILE__, __LINE__, #x, status);  
            exit(EXIT_FAILURE);
    }              
 }                 

int main ()
{
    const char * file_name = "test.gz";
    FILE * file;
    z_stream strm = {0};
    unsigned char in[CHUNK];
    unsigned char out[CHUNK];
    strm.zalloc = Z_NULL;
    strm.zfree = Z_NULL;
    strm.opaque = Z_NULL;
    strm.next_in = in;
    strm.avail_in = 0;
    CALL_ZLIB (inflateInit2 (& strm, windowBits | ENABLE_ZLIB_GZIP));
    /* Open the file. */
    file = fopen (file_name, "rb");
    while (1) {
        int bytes_read;
        bytes_read = fread (in, sizeof (char), sizeof (in), file);
        strm.avail_in = bytes_read;
        do {
            unsigned have;
            strm.avail_out = CHUNK;
            strm.next_out = out;
            CALL_ZLIB (inflate (& strm, Z_NO_FLUSH));
            have = CHUNK - strm.avail_out;
            fwrite (out, sizeof (unsigned char), have, stdout);
        }
        while (strm.avail_out == 0);
        if (feof (file)) {
            inflateEnd (& strm);
            break;
        }
    }
    return 0;
}

代码根据您最初指定的缓冲区准确读取和写入ZLIB文件。缓冲区大小固定为某个值(在上述情况下为 0x4000 )。

现在的问题是,我不能将此缓冲区的大小增加到一定值之外(我可以将3276008用作缓冲区大小,而不是32760008 )。要阅读12 GB压缩值,我需要使用一个非常大的缓冲区。正如我的编辑中指定的那样,这看起来像是某种DATA_ERROR不是BUFFER错误...因此毕竟不是缓冲区错误!

有什么办法如何使用上面的zlib函数记录整个12 GB压缩文件?

编辑#1

函数inflate返回的错误代码由CALL_ZLIB函数封装,我很遗憾未包括。因此,当我以0x4000的缓冲区大小运行时,我会得到以下错误代码。我也将CARN_ZLIB函数添加到代码中以供您参考。

错误msg:

parser.c:96: inflate(&strm, Z_NO_FLUSH) returned a bad status of -3。这显然看起来像一个** data_error。

编辑#2

我尝试将windowbits 的负值添加到AttrateInit2()中,但这并不能解决我的任何问题。Attrate()函数最初正确读取我的文件 - 按照我想要的方式显示我的所有数据..

0x55b0 [0x40]: event: 3
.
. ... raw event: size 64 bytes
.  0000:  03 00 00 00 00 00 40 00 18 03 00 00 18 03 00 00  ......@.........
.  0010:  4d 6f 64 65 6d 4d 61 6e 61 67 65 72 00 00 00 00  ModemManager....
.  0020:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
.  0030:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0 0 0x55b0 [0x40]: PERF_RECORD_COMM: ModemManager:792/792
0x55f0 [0x40]: event: 7
.
. ... raw event: size 64 bytes
.  0000:  07 00 00 00 00 00 40 00 19 03 00 00 01 00 00 00  ......@.........
.  0010:  19 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
.  0020:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
.  0030:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0 0 0x55f0 [0x40]: PERF_RECORD_FORK(793:793):(1:1)
0x5630 [0x40]: event: 3
.

但是一段时间后,显示的输出变得乱七八糟,我再也无法从中读取了。

0x4d68 [0x38]: ...........  001  0..
0 0 00 00 00 0 00 000 00 ze 64s
.  0000:  07 00 00 00 00 00 40 00 19 03 00 00 01 00 00 00  .. 00 0 event: size 64 bytes
.  0000:  03 00 00 00  si sisizsiz4s
.  0000:  07 00 00 00 00 00 40 00 19 0....
.  0030:  00 00 00 00 00 00 00 00 00 00 00 00 ..@.@.  0010:  19 03 00 00 [0x38]: ...........  001  0..
0 0 00 00 00 0 00 000 00 ze 64s
.  0000:  07 00 00 00 00 00 40 00 100 00 00 00 00  ..............0 0 0x4d28 [0x40]: PERF_RECORD_FORK(135:135):(2:62)
0x4d68 [0x38]: ...........  001  0..
0 0 00 00 00 0 00 000 00 00 00 00: PERORD_FORK(135:135):(2:2)

这最终终止了我在编辑#1

中描述的错误消息

我解决了问题。

基本问题是,我在循环内的代码中没有初始化z_stream的strm.next_in成员。因此,在进行了1次迭代后,缓冲区被损坏,我遇到了上述错误。

我将代码修改为 -

  strm.next_in = in;
  strm.avail_in = 0;
  CALL_ZLIB(inflateInit2 (&strm, windowBits | ENABLE_ZLIB_GZIP));
  file = fopen(filename, "rb");
  while(1)
  {
    int bytes_read;
    strm.next_in = in;     // added this line
    bytes_read = fread(in, sizeof(char), sizeof(in), file);
    strm.avail_in = bytes_read;
    do
    {
      unsigned have;
      strm.avail_out = CHUNK;
      strm.next_out  = out;

最新更新