为什么是ProtoBuf的GZip.. NET大于GZip格式的选项卡分隔值文件



我们最近比较了用ProtoBuf序列化的相同表格数据(想象一下单个表,六列,描述一个产品目录)各自的文件大小。.NET或TSV (tab分隔的数据),两个文件都用GZip压缩(默认的。NET实现)。

我很惊讶地注意到压缩的ProtoBuf。. NET版本比文本版本占用更多的空间(最多3倍)。

我最喜欢的理论是ProtoBuf不尊重byte语义,因此不匹配GZip频率压缩树;因此压缩效率相对较低。

另一种可能性是ProtoBuf实际上编码了更多的数据(例如,为了便于模式版本控制),因此序列化的格式在信息方面并不是严格可比的。

有人注意到同样的问题吗?压缩ProtoBuf是否值得?

这里可能有很多因素;首先,请注意协议缓冲区连线格式对字符串使用直接的UTF-8编码;如果你的数据以字符串为主,它最终将需要与TSV相同数量的空间。

协议缓冲区还设计用于帮助存储结构化数据,即比单表场景更复杂的模型。这对的大小贡献不大,但是开始与xml/json等(在功能方面更相似)进行比较,差异更加明显。

此外,由于协议缓冲区非常密集(尽管是UTF-8),在某些情况下压缩它实际上可以使更大 -您可能想要检查这里是否存在这种情况。

在您所展示的场景的一个快速示例中,两种格式给出了大致相同的大小—没有大的跳跃:

protobuf-net, no compression: 2498720 bytes, write 34ms, read 72ms, chk 50000
protobuf-net, gzip: 1521215 bytes, write 234ms, read 146ms, chk 50000
tsv, no compression: 2492591 bytes, write 74ms, read 122ms, chk 50000
tsv, gzip: 1258500 bytes, write 238ms, read 169ms, chk 50000

tsv在这种情况下稍微小一些,但tsv最终确实是一种非常简单的格式(在结构化数据方面具有非常有限的功能),所以它很快也就不足为奇了。

确实;如果您存储的只是一个非常简单的单表,那么TSV是一个不错的选择——然而,它最终是一种非常有限的格式。我无法复制你那个"大得多"的例子。

除了对结构化数据(和其他特性)的更丰富的支持外,protobuf还非常强调处理性能。现在,由于TSV非常简单,这里的边缘不会是巨大的(但在上面是显而易见的),但是再次:与xml, json或内置的BinaryFormatter进行测试针对具有类似功能的格式,差异是明显的。


上面数字的例子(更新为使用BufferedStream):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.Text;
using ProtoBuf;
static class Program
{
    static void Main()
    {
        RunTest(12345, 1, new StringWriter()); // let everyone JIT etc
        RunTest(12345, 50000, Console.Out); // actual test
        Console.WriteLine("(done)");
        Console.ReadLine();
    }
    static void RunTest(int seed, int count, TextWriter cout)
    {
        var data = InventData(seed, count);
        byte[] raw;
        Catalog catalog;
        var write = Stopwatch.StartNew();
        using(var ms = new MemoryStream())
        {
            Serializer.Serialize(ms, data);
            raw = ms.ToArray();
        }
        write.Stop();
        var read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        {
            catalog = Serializer.Deserialize<Catalog>(ms);
        }
        read.Stop();
        cout.WriteLine("protobuf-net, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;
        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())   
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress, true))
            using (var bs = new BufferedStream(gzip, 64 * 1024))
            {
                Serializer.Serialize(bs, data);
            } // need to close gzip to flush it (flush doesn't flush)
            raw = ms.ToArray();
        }
        write.Stop();
        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        {
            catalog = Serializer.Deserialize<Catalog>(gzip);
        }
        read.Stop();
        cout.WriteLine("protobuf-net, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;
        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var writer = new StreamWriter(ms))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();
        read = Stopwatch.StartNew();
        using (var ms = new MemoryStream(raw))
        using (var reader = new StreamReader(ms))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();
        cout.WriteLine("tsv, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;
        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress))
            using(var bs = new BufferedStream(gzip, 64 * 1024))
            using(var writer = new StreamWriter(bs))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();
        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        using(var reader = new StreamReader(gzip))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();
        cout.WriteLine("tsv, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
    }
    private static Catalog ReadTsv(StreamReader reader)
    {
        string line;
        List<Product> list = new List<Product>();
        while((line = reader.ReadLine()) != null)
        {
            string[] parts = line.Split('t');
            var row = new Product();
            row.Id = int.Parse(parts[0]);
            row.Name = parts[1];
            row.QuantityAvailable = int.Parse(parts[2]);
            row.Price = decimal.Parse(parts[3]);
            row.Weight = int.Parse(parts[4]);
            row.Sku = parts[5];
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
    private static void WriteTsv(Catalog catalog, StreamWriter writer)
    {
        foreach (var row in catalog.Products)
        {
            writer.Write(row.Id);
            writer.Write('t');
            writer.Write(row.Name);
            writer.Write('t');
            writer.Write(row.QuantityAvailable);
            writer.Write('t');
            writer.Write(row.Price);
            writer.Write('t');
            writer.Write(row.Weight);
            writer.Write('t');
            writer.Write(row.Sku);
            writer.WriteLine();
        }
    }
    static Catalog InventData(int seed, int count)
    {
        string[] lipsum =
            @"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
                .Split(' ');
        char[] skuChars = "0123456789abcdef".ToCharArray();
        Random rand = new Random(seed);
        var list = new List<Product>(count);
        int id = 0;
        for (int i = 0; i < count; i++)
        {
            var row = new Product();
            row.Id = id++;
            var name = new StringBuilder(lipsum[rand.Next(lipsum.Length)]);
            int wordCount = rand.Next(0,5);
            for (int j = 0; j < wordCount; j++)
            {
                name.Append(' ').Append(lipsum[rand.Next(lipsum.Length)]);
            }
            row.Name = name.ToString();
            row.QuantityAvailable = rand.Next(1000);
            row.Price = rand.Next(10000)/100M;
            row.Weight = rand.Next(100);
            char[] sku = new char[10];
            for(int j = 0 ; j < sku.Length ; j++)
                sku[j] = skuChars[rand.Next(skuChars.Length)];
            row.Sku = new string(sku);
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
}
[ProtoContract]
public class Catalog
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<Product> Products { get; set; } 
}
[ProtoContract]
public class Product
{
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public string Name { get; set; }
    [ProtoMember(3)]
    public int QuantityAvailable { get; set;}
    [ProtoMember(4)]
    public decimal Price { get; set; }
    [ProtoMember(5)]
    public int Weight { get; set; }
    [ProtoMember(6)]
    public string Sku { get; set; }
}

GZip是一个流压缩器。如果没有正确地缓冲数据,压缩将会非常差,因为它只会对小块进行操作,从而导致压缩效率大大降低。

尝试在序列化器和GZipStream之间放置一个BufferedStream,并使用适当大小的缓冲区。

示例:压缩Int32序列1..使用BinaryWriter直接写入GZipStream将产生~650kb,而使用64kb的BufferedStream将只产生~340kb的压缩数据。

最新更新