iTextSharp只返回特定PDF文件的空白字符串

我正在测试这个简单的代码：

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace PDF_TXT2
{
class Program
{
[STAThread]
static void Main(string[] args)
{
string path = args[0];
string pathFileName = System.IO.Path.GetFileNameWithoutExtension(path);
string pathFolder = System.IO.Path.GetDirectoryName(path);
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
Clipboard.SetText(text);
MessageBox.Show(text);
}
}
}

这个特定的PDF文件会导致一个空字符串。实际上并不是空的，只是充满了空白。

你能帮我理解为什么吗？

非常感谢！

PDF中的字体有这个条目

/ToUnicode/Identity-H

即ToUnicode的值是名称Identity-H。

不过，根据PDF规范，ToUnicode的值必须是流！

ToUnicode流(可选(包含将字符代码映射到Unicode值的CMap文件的流(请参阅9.10，"文本内容的提取"(。

因此，文件中的ToUnicode映射无效，这可能会在文本提取过程中导致任意错误。

相关内容

最新更新

热门标签：