PDFBox在提取使用字体DejaVu Sans Condensed编码的文本时抛出错误


PDDocument document = PDDocument.load(file);
if( document.isEncrypted() )
{
document.setAllSecurityToBeRemoved(false);
}
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSortByPosition( true );
String text = stripper.getText(document);
System.out.println(text);
OutputStreamWriter writer =
new OutputStreamWriter(new FileOutputStream("C:\preface.txt"), StandardCharsets.UTF_8);
writer.write(text);
writer.flush();
writer.close();

我正在尝试从用Dejavu Sans Condensed和DejaVu Sans Condensed-Bold编码的PDF文件中提取文本,但它抛出了一个错误,给出如下:

SEVERE: Could not read ToUnicode CMap in font DejaVuSansCondensed
java.io.IOException: Error: expected the end of a dictionary.
at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:477)
at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:112)
at org.apache.pdfbox.pdmodel.font.CMapManager.parseCMap(CMapManager.java:75)
at org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:197)
at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:137)
at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:176)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at Library.main(Library.java:32)
Jun 03, 2018 1:30:59 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font DejaVuSansCondensed are not implemented in PDFBox and will be ignored
Jun 03, 2018 1:30:59 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+98 (98) in font DejaVuSansCondensed
Jun 03, 2018 1:30:59 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+105 (105) in font DejaVuSansCondensed

我还发现该特定 pdf 文件集没有 unicode 映射。请帮助编写该程序的 unicode 映射

附言我是 PDFBox 的新手

我可以通过降级到 PDFBox 2.0.2 来解决这个问题。

最新更新