PDFBox程序无法从pdf中正确读取非英文字符



itext程序被PDFBox程序替换为读取PDF文件。

            public static void main(String[] args) {  
                // TODO Auto-generated method stub 
                PDDocument     pd;  
                BufferedWriter wr;  
                try {  
                    File input = new File("C:\test\ExtractTextFromThis.pdf");    // The PDF file from where you would like to extract  
                    File output = new File("C:\test\OutPut.txt");    // The text file where you are going to store the extracted data  

//加载文档。

                    pd = PDDocument.load(input);  // load document
                    pd.setAllSecurityToBeRemoved(true);  

//尝试检查语言

                        System.out.println(pd.getDocumentCatalog().getLanguage());
                        PDFTextStripper stripper = new PDFTextStripper("UTF-8");  // Initializing PDFTextStripper Object with UTF-8 encoding.
       //          PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-16");
                    // Please provide example for this. In attached document,I want to extract text from rectangle. There are 30 boxes.
                    wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));  
                    stripper.writeText(pd, wr);
                    System.out.println(stripper.getText(pd));  
                    String text = stripper.getText(pd);  
                    char[] cArr = text.toCharArray();  
// Here is the problem. It's not printing characters of Kannada language within its UTF Range. 
//printing chracters -- their integer value --  their Hexadecimal value  
                    for (int i = 1; i < 130; i++) {  
                        System.out.println(cArr[i] + "t" + (int) cArr[i] + "t" + Integer.toHexString(cArr[i]));  
                    }  
                    if (pd != null) {  
                        pd.close();  
                    }  
                    wr.close();  
                } catch (Exception e) {  
                    e.printStackTrace();  
                }  
            }  

您似乎认为Itext的缺点实际上是算法中提取页面内容的误解:

您假设内容流中的字符串实际上是单座的编码。它们不必是,尤其是尤其是非ASCII字符。翻译信息(如果有的话!)包含在字体词典中。

另外,您假设所有文本字符串都直接包含在内容流中。这不必是真的:内容流可以参考其他可以包含代码无法找到的文本的对象。

,您还假设页面的内容条目是单个间接流。实际上,它也可以是它们的数组。

我建议您在解析器软件包中切换到使用ITEXT的文本解析类,这些iText conpect将所有这些内容和更多内容都考虑到。

最新更新