使用Apache POI和Apache PDFBox读取文档、pdf文件中的文本框位置错误

我正在尝试用Java读取和处理.doc、.docx和.pdf文件，方法是使用Apache POI（用于doc、docx）和Apache PDFBox库将它们转换为单个字符串
在遇到文本框之前，它一直可以正常工作。如果格式是这样的：

第1段
文本框1
第2段
文本框2
段落3

那么输出应该是：
第1段文本框1第2段文本框2第3段
但我得到的输出是：
第1段第2段第3段文本框1文本框2

它似乎在末尾添加了文本框，而不是在应该添加的地方（段落之间）。这个问题在doc和pdf文件的情况下都存在。这意味着两个库、POI和PDFBox都给出了相同的问题

读取pdf文件的代码为：

void pdf（字符串文件）引发IOException{//初始化文件File myFile=新文件（文件）；PDDocument pdDoc=null；尝试{//加载PDFpdDoc=PDDocument.load（myFile）；//创建提取器PDFTextStripper pdf=新PDFTextStripper（）；//提取文本output=pdf.getText（pdDoc）；}最后{if（pdDoc！=null）//关闭文档pdDoc.close（）；}}

文档文件的代码为：

void doc（字符串文件）抛出FileNotFoundException、IOException{文件myFile=null；WordExtractor提取器=null；//初始化文件myFile=新文件（文件）；//创建文件输入流FileInputStream fis=新的FileInputStream（myFile.getAbsolutePath（））；//打开的文档HWPFDocument document=新的HWPFDocument（fis）；//创建提取器提取器=新的WordExtractor（文档）；//从文档中获取文本output=提取器.getText（）；}

对于PDFBox，请执行以下操作：pdf.setSortByPosition（true）；

请尝试以下pdf代码。以类似的方式，你也可以尝试为医生。

void extractPdfTexts(String file) {
    File myFile = new File(file);
    String output;
    try (PDDocument pdDocument = PDDocument.load(myFile)) {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        output = pdfTextStripper.getText(pdDocument);
        System.out.println(output);
    } catch (InvalidPasswordException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

相关内容

最新更新

热门标签：