我已经使用tikaparser从'.doc'文件中提取纯文本
public static void main(String[] args) throws Exception {
ContentHandler handler = new ToHTMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream content = new FileInputStream("file.doc");
parser.parse(content, handler, metadata, context);
System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
FileOutputStream outStream = new FileOutputStream("file.doc.txt");
outStream.write(handler.toString().getBytes());
outStream.close();
content.close();
}
这适用于大多数文件,但是对于特定文件,它正在抛出以下例外
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@7c417213
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.goarya.app.resumestorage.migration.TikaParser.main(TikaParser.java:29)
Caused by: java.lang.IllegalArgumentException: The end (7161) must not be before the start (7162)
at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:208)
at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:194)
at org.apache.poi.hwpf.usermodel.Paragraph.<init>(Paragraph.java:165)
at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:144)
at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:766)
at org.apache.poi.hwpf.extractor.WordExtractor.getParagraphText(WordExtractor.java:168)
at org.apache.poi.hwpf.extractor.WordExtractor.getMainTextboxText(WordExtractor.java:145)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:130)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 3 more
在Microsoft Word中打开时的DOC文件显示没有错误。
另外,在C#中使用Microsoft.Office.Interop.Word
给出纯文本。
如何使用Apache Tika克服此问题?
编辑:为此方案添加示例文档
我正在使用tika cote1.2 jar,并且我的程序已通过以下代码成功运行。
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.SAXException;
public class Exmple2 {
public static void main(final String[] args) throws IOException,TikaException, SAXException {
ToHTMLContentHandler handler = new ToHTMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream content = new FileInputStream("/home/ist/FTRDocuments/taableDis.docx");
parser.parse(content, handler, metadata, context);
System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
FileOutputStream outStream = new FileOutputStream("/home/ist/file.doc.txt");
outStream.write(handler.toString().getBytes());
outStream.close();
content.close();
}
}
tika1.2唯一的变化是使用ContentHandler的tohtmlContentHandler。