使用iTextSharp进行PDF压缩



我目前正在尝试重新压缩已经创建的pdf,我正在尝试找到一种方法来重新压缩文档中的图像,以减小文件大小。

我一直在尝试使用DataLogics PDE和iTextSharp库来实现这一点,但我找不到对项目进行流重新压缩的方法。

不过,我已经讨论过在xobjects上循环并获取图像,然后将DPI降至96,或者使用libjpeg C#实现来更改图像的质量,但将其返回到pdf流中似乎总是会出现内存损坏或其他问题。

任何样品都将不胜感激。

感谢

iText和iTextSharp有一些替换间接对象的方法。具体来说,有PdfReader.KillIndirect(),它会照它说的做,还有PdfWriter.AddDirectImageSimple(iTextSharp.text.Image, PRIndirectReference),你可以用它来代替你杀死的东西

在伪C#代码中,你会做:

var oldImage = PdfReader.GetPdfObject();
var newImage = YourImageCompressionFunction(oldImage);
PdfReader.KillIndirect(oldImage);
yourPdfWriter.AddDirectImageSimple(newImage, (PRIndirectReference)oldImage);

将原始字节转换为.Net图像可能很棘手,我将由您决定,或者您可以在此处搜索。马克在这里有一个很好的描述。此外,从技术上讲,PDF没有DPI的概念,这主要是针对打印机的。请参阅此处的答案以了解更多信息。

使用上面的方法,您的压缩算法实际上可以做两件事,物理收缩图像以及应用JPEG压缩。当您对图像进行物理收缩并将其添加回来时,它将占用与原始图像相同的空间,但使用的像素较少。这将为您带来您认为的DPI降低。JPEG压缩不言自明。

以下是针对iTextSharp 5.1.1.0的完整的C#2010 WinForms应用程序。它使用桌面上一个名为"LargeImage.jpg"的现有JPEG,并从中创建一个新的PDF。然后打开PDF,提取图像,将其物理缩小到原始大小的90%,应用85%的JPEG压缩并将其写回PDF。有关更多解释,请参阅代码中的注释。代码需要更多的空/错误检查。同时查找NOTE注释,您需要在其中展开以处理其他情况。

using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.Drawing.Drawing2D;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication1 {
    public partial class Form1 : Form {
        public Form1() {
            InitializeComponent();
        }
        private void Form1_Load(object sender, EventArgs e) {
            //Our working folder
            string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
            //Large image to add to sample PDF
            string largeImage = Path.Combine(workingFolder, "LargeImage.jpg");
            //Name of large PDF to create
            string largePDF = Path.Combine(workingFolder, "Large.pdf");
            //Name of compressed PDF to create
            string smallPDF = Path.Combine(workingFolder, "Small.pdf");
            //Create a sample PDF containing our large image, for demo purposes only, nothing special here
            using (FileStream fs = new FileStream(largePDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
                using (Document doc = new Document()) {
                    using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                        doc.Open();
                        iTextSharp.text.Image importImage = iTextSharp.text.Image.GetInstance(largeImage);
                        doc.SetPageSize(new iTextSharp.text.Rectangle(0, 0, importImage.Width, importImage.Height));
                        doc.SetMargins(0, 0, 0, 0);
                        doc.NewPage();
                        doc.Add(importImage);
                        doc.Close();
                    }
                }
            }
            //Now we're going to open the above PDF and compress things
            //Bind a reader to our large PDF
            PdfReader reader = new PdfReader(largePDF);
            //Create our output PDF
            using (FileStream fs = new FileStream(smallPDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
                //Bind a stamper to the file and our reader
                using (PdfStamper stamper = new PdfStamper(reader, fs)) {
                    //NOTE: This code only deals with page 1, you'd want to loop more for your code
                    //Get page 1
                    PdfDictionary page = reader.GetPageN(1);
                    //Get the xobject structure
                    PdfDictionary resources = (PdfDictionary)PdfReader.GetPdfObject(page.Get(PdfName.RESOURCES));
                    PdfDictionary xobject = (PdfDictionary)PdfReader.GetPdfObject(resources.Get(PdfName.XOBJECT));
                    if (xobject != null) {
                        PdfObject obj;
                        //Loop through each key
                        foreach (PdfName name in xobject.Keys) {
                            obj = xobject.Get(name);
                            if (obj.IsIndirect()) {
                                //Get the current key as a PDF object
                                PdfDictionary imgObject = (PdfDictionary)PdfReader.GetPdfObject(obj);
                                //See if its an image
                                if (imgObject.Get(PdfName.SUBTYPE).Equals(PdfName.IMAGE)) {
                                    //NOTE: There's a bunch of different types of filters, I'm only handing the simplest one here which is basically raw JPG, you'll have to research others
                                    if (imgObject.Get(PdfName.FILTER).Equals(PdfName.DCTDECODE)) {
                                        //Get the raw bytes of the current image
                                        byte[] oldBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObject);
                                        //Will hold bytes of the compressed image later
                                        byte[] newBytes;
                                        //Wrap a stream around our original image
                                        using (MemoryStream sourceMS = new MemoryStream(oldBytes)) {
                                            //Convert the bytes into a .Net image
                                            using (System.Drawing.Image oldImage = Bitmap.FromStream(sourceMS)) {
                                                //Shrink the image to 90% of the original
                                                using (System.Drawing.Image newImage = ShrinkImage(oldImage, 0.9f)) {
                                                    //Convert the image to bytes using JPG at 85%
                                                    newBytes = ConvertImageToBytes(newImage, 85);
                                                }
                                            }
                                        }
                                        //Create a new iTextSharp image from our bytes
                                        iTextSharp.text.Image compressedImage = iTextSharp.text.Image.GetInstance(newBytes);
                                        //Kill off the old image
                                        PdfReader.KillIndirect(obj);
                                        //Add our image in its place
                                        stamper.Writer.AddDirectImageSimple(compressedImage, (PRIndirectReference)obj);
                                    }
                                }
                            }
                        }
                    }
                }
            }
            this.Close();
        }
        //Standard image save code from MSDN, returns a byte array
        private static byte[] ConvertImageToBytes(System.Drawing.Image image, long compressionLevel) {
            if (compressionLevel < 0) {
                compressionLevel = 0;
            } else if (compressionLevel > 100) {
                compressionLevel = 100;
            }
            ImageCodecInfo jgpEncoder = GetEncoder(ImageFormat.Jpeg);
            System.Drawing.Imaging.Encoder myEncoder = System.Drawing.Imaging.Encoder.Quality;
            EncoderParameters myEncoderParameters = new EncoderParameters(1);
            EncoderParameter myEncoderParameter = new EncoderParameter(myEncoder, compressionLevel);
            myEncoderParameters.Param[0] = myEncoderParameter;
            using (MemoryStream ms = new MemoryStream()) {
                image.Save(ms, jgpEncoder, myEncoderParameters);
                return ms.ToArray();
            }
        }
        //standard code from MSDN
        private static ImageCodecInfo GetEncoder(ImageFormat format) {
            ImageCodecInfo[] codecs = ImageCodecInfo.GetImageDecoders();
            foreach (ImageCodecInfo codec in codecs) {
                if (codec.FormatID == format.Guid) {
                    return codec;
                }
            }
            return null;
        }
        //Standard high quality thumbnail generation from http://weblogs.asp.net/gunnarpeipman/archive/2009/04/02/resizing-images-without-loss-of-quality.aspx
        private static System.Drawing.Image ShrinkImage(System.Drawing.Image sourceImage, float scaleFactor) {
            int newWidth = Convert.ToInt32(sourceImage.Width * scaleFactor);
            int newHeight = Convert.ToInt32(sourceImage.Height * scaleFactor);
            var thumbnailBitmap = new Bitmap(newWidth, newHeight);
            using (Graphics g = Graphics.FromImage(thumbnailBitmap)) {
                g.CompositingQuality = CompositingQuality.HighQuality;
                g.SmoothingMode = SmoothingMode.HighQuality;
                g.InterpolationMode = InterpolationMode.HighQualityBicubic;
                System.Drawing.Rectangle imageRectangle = new System.Drawing.Rectangle(0, 0, newWidth, newHeight);
                g.DrawImage(sourceImage, imageRectangle);
            }
            return thumbnailBitmap;
        }
    }
}

我不知道iTextSharp,但如果有任何更改,您必须重写PDF文件,因为它包含一个外部参照表(索引),其中包含每个对象的确切文件位置。这意味着,即使添加或删除了一个字节,PDF也会损坏。

如果图像是B&W、 或者JPEG2000,Jasper库将很乐意为其编码JPEG2000码流,以便以您所希望的任何质量放置到PDF文件中。

如果是我,我会在没有PDF库的情况下从代码中完成这一切。只需找到所有图像(在JPXDecode(JPEG2000)、JBIG2Decode(JBIG2)或DCTDecode(JPEG)出现后,streamendstream之间的任何图像)即可将其取出,用Jasper重新编码,然后再次将其粘回并更新外部参照表。

要更新外部参照表,请查找每个对象的位置(从00001 0 obj开始),然后只更新外部参照表格中的新位置。这不是太多的工作,比听起来少。你也许可以用一个正则表达式获得所有的偏移量(我不是C#程序员,但在PHP中,它就这么简单。)

然后,最后用外部参照表开头的偏移量更新trailerstartxref标记的值(在文件中它表示xref)。

否则,你最终会解码整个PDF并重写它,这将是缓慢的,并且你可能会在这一过程中丢失一些东西。

iText的创建者提供了一个如何在现有PDF中查找和替换图像的示例。这实际上是他书中的一小段节选。由于它是用Java编写的,这里有一个简单的替换:

public void ReduceResolution(PdfReader reader, long quality) {
  int n = reader.XrefSize;
  for (int i = 0; i < n; i++) {
    PdfObject obj = reader.GetPdfObject(i);
    if (obj == null || !obj.IsStream()) {continue;}
    PdfDictionary dict = (PdfDictionary)PdfReader.GetPdfObject(obj);
    PdfName subType = (PdfName)PdfReader.GetPdfObject(
      dict.Get(PdfName.SUBTYPE)
    );
    if (!PdfName.IMAGE.Equals(subType)) {continue;}
    PRStream stream = (PRStream )obj;
    try {
      PdfImageObject image = new PdfImageObject(stream);
      PdfName filter = (PdfName) image.Get(PdfName.FILTER);
      if (
        PdfName.JBIG2DECODE.Equals(filter)
        || PdfName.JPXDECODE.Equals(filter)
        || PdfName.CCITTFAXDECODE.Equals(filter)
        || PdfName.FLATEDECODE.Equals(filter)
      ) continue;
      System.Drawing.Image img = image.GetDrawingImage();
      if (img == null) continue;
      var ll = image.GetImageBytesType();
      int width = img.Width;
      int height = img.Height;
      using (System.Drawing.Bitmap dotnetImg =
         new System.Drawing.Bitmap(img))
      {
        // set codec to jpeg type => jpeg index codec is "1"
        System.Drawing.Imaging.ImageCodecInfo codec =
        System.Drawing.Imaging.ImageCodecInfo.GetImageEncoders()[1];
        // set parameters for image quality
        System.Drawing.Imaging.EncoderParameters eParams =
         new System.Drawing.Imaging.EncoderParameters(1);
        eParams.Param[0] =
         new System.Drawing.Imaging.EncoderParameter(
           System.Drawing.Imaging.Encoder.Quality, quality
        );
        using (MemoryStream msImg = new MemoryStream()) {
          dotnetImg.Save(msImg, codec, eParams);
          msImg.Position = 0;
          stream.SetData(msImg.ToArray());
          stream.SetData(
           msImg.ToArray(), false, PRStream.BEST_COMPRESSION
          );
          stream.Put(PdfName.TYPE, PdfName.XOBJECT);
          stream.Put(PdfName.SUBTYPE, PdfName.IMAGE);
          stream.Put(PdfName.FILTER, filter);
          stream.Put(PdfName.FILTER, PdfName.DCTDECODE);
          stream.Put(PdfName.WIDTH, new PdfNumber(width));
          stream.Put(PdfName.HEIGHT, new PdfNumber(height));
          stream.Put(PdfName.BITSPERCOMPONENT, new PdfNumber(8));
          stream.Put(PdfName.COLORSPACE, PdfName.DEVICERGB);
        }
      }
    }
    catch {
      // throw;
      // iText[Sharp] can't handle all image types...
    }
    finally {
// may or may not help      
      reader.RemoveUnusedObjects();
    }
  }
}

您会注意到它仅处理JPEG。逻辑是相反的(而不是显式地只处理DCTDECODE/JPEG),因此您可以取消注释一些被忽略的图像类型,并在上面的代码中试用PdfImageObject。特别地,大多数FLATEDECODE图像(.bmp、.png和.gif)表示为png(在PdfImageObject源代码的DecodeImageBytes方法中确认)。据我所知,.NET不支持PNG编码。这里和这里都有一些参考资料支持这一点。您可以尝试一个独立的PNG优化可执行文件,但也必须弄清楚如何在PRStream中设置PdfName.BITSPERCOMPONENTPdfName.COLORSPACE

为了完整起见,由于您的问题特别询问PDF压缩,以下是如何使用iTextSharp:压缩PDF

PdfStamper stamper = new PdfStamper(
  reader, YOUR-STREAM, PdfWriter.VERSION_1_5
);
stamper.Writer.CompressionLevel = 9;
int total = reader.NumberOfPages + 1;
for (int i = 1; i < total; i++) {
  reader.SetPageContent(i, reader.GetPageContent(i));
}
stamper.SetFullCompression();
stamper.Close();

您也可以尝试通过PdfSmartCopy运行PDF以减小文件大小。它删除了冗余资源,但与finally块中对RemoveUnusedObjects()的调用一样,它可能有帮助,也可能没有帮助。这将取决于PDF是如何创建的。

IIRC iText[Sharp]不能很好地处理JBIG2DECODE,所以@Alasdair的建议看起来不错——如果你想花时间学习Jasper库并使用暴力方法的话。

祝你好运。

编辑-2012-08-17,@Craig:评论

使用上述ReduceResolution()方法压缩jpegs后保存PDF:

a。实例化PdfReader对象:

PdfReader reader = new PdfReader(pdf);

b。将PdfReader传递给上面的ReduceResolution()方法。

c。将更改后的PdfReader传递给PdfStamper。以下是使用MemoryStream:的一种方法

// Save altered PDF. then you can pass the btye array to a database, etc
using (MemoryStream ms = new MemoryStream()) {
  using (PdfStamper stamper = new PdfStamper(reader, ms)) {
  }
  return ms.ToArray();
}

如果不需要将PDF保存在内存中,也可以使用任何其他Stream。例如,使用FileStream并直接保存到磁盘。

我已经编写了一个库来实现这一点。它还将使用Tesseract或楔形文字对pdf进行OCR,并创建可搜索的压缩pdf文件。它是一个使用几个开源项目(iTextsharp、jbig2编码器、Aforge、muPDF#)来完成任务的库。你可以在这里查看http://hocrtopdf.codeplex.com/

我不确定您是否正在考虑其他库,但您可以使用Docotic.Pdf库轻松地重新压缩现有图像(免责声明:我为该公司工作)。

以下是一些示例代码:

static void RecompressExistingImages(string fileName, string outputName)
{
    using (PdfDocument doc = new PdfDocument(fileName))
    {
        foreach (PdfImage image in doc.Images)
            image.RecompressWithGroup4Fax();
        doc.Save(outputName);
    }
}

还有RecompressWithFlateRecompressWithGroup3FaxRecompressWithJpegUncompress方法。

如果需要,该库将把彩色图像转换为双层图像。您可以指定deflate压缩级别、JPEG质量等

我也要求你们在使用@Alasdair建议的方法之前三思。如果你要处理不是你创建的PDF文件,那么任务可能会比看起来复杂得多。

首先,存在大量由JPXDecodeJBIG2DecodeDCTDecode以外的编解码器压缩的图像。PDF也可以包含内联图像。

使用较新版本的标准(1.5或更高版本)保存的PDF文件可以包含交叉引用流。这意味着读取和更新这些文件比仅仅在文件末尾查找/更新一些数字更复杂。

所以,请使用PDF库。

压缩PDF的一种简单方法是使用gsdll32.dll(Ghostscript)和Cyotek.Ghostscript.dll(包装器):

public static void CompressPDF(string sInFile, string sOutFile, int iResolution)
    {
        string[] arg = new string[]
        {
            "-sDEVICE=pdfwrite",
            "-dNOPAUSE",
            "-dSAFER",
            "-dBATCH",
            "-dCompatibilityLevel=1.5",
            "-dDownsampleColorImages=true",
            "-dDownsampleGrayImages=true",
            "-dDownsampleMonoImages=true",
            "-sPAPERSIZE=a4",
            "-dPDFFitPage",
            "-dDOINTERPOLATE",
            "-dColorImageDownsampleThreshold=1.0",
            "-dGrayImageDownsampleThreshold=1.0",
            "-dMonoImageDownsampleThreshold=1.0",
            "-dColorImageResolution=" + iResolution.ToString(),
            "-dGrayImageResolution=" + iResolution.ToString(),
            "-dMonoImageResolution=" + iResolution.ToString(),
            "-sOutputFile=" + sOutFile,
            sInFile
        };
        using(GhostScriptAPI api = new GhostScriptAPI())
        {
            api.Execute(arg);
        }
    }

最新更新